Although in the ideal case, the representations built from the previous stage should be independent of the batch effects, according to our trials, it is hard to retrieve the corrected expression profiles only by the generators G1 and G2. Therefore, we further use a GAN-based model to almost perfectly match the data distributions of the shared cell types across different batches and then generate the corrected expression profiles in the stage II. The basic idea here is to transform cells from all other batches to pseudo-cells of one pre-selected “anchor” batch, and the pseudo-cells are expected to be indistinguishable from true cells of the anchor batch. By indistinguishableness, we do not pursue perfect overlap with true cells for each single pseudo-cell, but endeavor to match the distribution of pseudo-cells with the distribution of true cells with the same or similar biological contents.

We adopt a specialized MNN pair-based strategy to guide the integration, for only matching the distributions of cells from the shared cell types between two batches. An MNN pair is defined as a set of two cells from two batches respectively, that each cell is among the k nearest across-batch neighbors of the other cell [9]. We use the encoder output E(x) from the stage I to define MNN pairs, because these representations are supposed to be batch effect independent, resulting in a larger number of MNN pairs than using the original expression vectors, as we shown in Fig. 3e. Other methods based on MNN pairs may regard these pairs as anchors and then use a weighted averaging strategy to correct all other cells. One major potential drawback of the MNN pairs is that it is hard to assure these pairs could cover the complete distributions of cells from the shared cell types (Fig. 1d). We alternatively develop a novel random walk-based strategy to expand the MNN pair list. As shown in Fig. 1d, suppose cell a1 from batch 1 and cell a2 from batch 2 are selected as an MNN pair. Among the k1 nearest neighbors of a1 from batch 1, we randomly pick one cell b1. The same procedure would give one b2 cell from batch 2. Then, the set composed of b1 and b2 is regarded as an extended MNN pair, and also the next seed pair for random walk expansion. This process is repeated m times. For all MNN pairs, we could generate these kinds of new pairs. We call pairs obtained from this procedure as rwMNN pairs. The generated rwMNN pairs can better cover the distributions of matched cell types, which could facilitate the training of GANs (Fig. 3f). We argue that it is also beneficial to adopt rwMNN pairs for other MNN-based methods (Additional file 1: Fig. S11).

Next, we use those rwMNN pairs, denoted as x1x2ii=1M (the superscript indexing its batch origin) to train the GAN model. This model is composed of two neural networks, one generator G, mapping cell expression vector x(1) to a pseudo-cell expression vector G(x(1)), and one discriminator D, discriminating the pseudo cell from the true expression vector x(2). The adversarial loss is:

After training, all cells including those not in the rwMNN list could be transformed by the generator G to obtain the batch effect removal expression vectors.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.