The iMAP model was initially inspired by neural style transfer methodologies from the computer vision field [17]. The essential objective of style transfer is to transfer natural images to paints plausibly created by one specific artist, while retaining the underlying image contents. Here we regarded different measuring processes of single-cell transcriptomes as specific “painting styles,” and then batch effect removal could be realized by transforming all cells into the same batch style. GAN-based models are current state-of-art frameworks for style transfer of images [39]. Although hitherto, available style transfer models are designed specialized for images and not suitable for biological datasets.

The most difficult challenge for batch effect removal is to balance the tradeoff between discerning identification of the batch-specific cell types and sufficient mixing of the batch-shared cell types. To overcome this entangled matter, we formalize our iMAP integration model into two stages, with one stage of representation learning and the other stage of batch effect removal of the original expression profiles. Specially, we elaborate a novel autoencoder structure in the first stage, to build representations of effects of biological variations disentangled from measurement noises on single-cell transcriptomes. These representations could already discriminate the batch-specific cell types and roughly mix those shared between batches. Further in the second stage, we can successfully decipher and eliminate the batch effects on the expression profile of each single cell, by virtue of the strong power of GANs for mixing cell distributions from different batches. To make GANs easily capture and match different modes of shared biological variations across batches, we only employ those cells with plausibly similar biological content in the training process to avoid the possibly detrimental mixture of the batch-specific cells and devise a specialized random walk procedure to fully cover the underlying cell type distributions. Details were further explained below.

We modeled the measured expression vector as the coupled effects of true biological variations and inevitable measurement noises. Although the measuring process may have distinctive effects on different cell types, it is reasonable to assume the true biological variations are independent of measuring noises. Considering that distilling the underlying biological contents from transcriptome measures is the critical step to remove the batch effects, we first designed a novel autoencoder structure to build representations of biological variations, which are expected to be disentangled from measuring noises.

Three forward neural networks are deployed in this stage, including one content encoder E, two generators (decoders) G1 and G2 (Fig. 1b). The inputs to the model include the expression vector of one cell denoted as x, and its batch indicator vector b. One-hot encoding strategy is used to indicate the batch of the cell. For instance, in the case of three batches, cells from the first batch have their batch indicator vector b = [1, 0, 0]T. The output of the encoder E is denoted as c = E(x), which is expected to exclusively represent the biological contents of cells, and be ignorant of the measuring process. The neural network G1 is deployed to generate the representation of measurement noise G1(b), since the measurement noise cannot be fully captured by a simple one-hot vector. Another generator G2 is further used to finish the reconstruction of the original expression vector. The inputs to the generator G2 include both E(x) and b, because intuitively, it is possible for the generator to reconstruct the original measured expression vector only if both the biological content and measurement noise are simultaneously provided. The final reconstructed expression vector is G(E(x), b) = f(G1(b) + G2(E(x), b)), where f is a non-linear transformation, and is used to match the range of reconstructed vector with the original expression vector. The ReLU function f(x) = max(0, x) can be the default candidate for non-negative expression vectors. The reconstruction loss ( represents expectation) can be formalized as:

The key to successfully extract biological contents of one cell is to disentangle the biological representation c from the corresponding cell batch indicator b. We achieve this by deliberately generating a random batch indicator vector b~ for each cell, where randomly selected one element is set to 1 while others to 0. Well-trained generators G1 and G2, with E(x) and b~ as inputs, should fabricate one cell with the same content as x. This inspired our content loss as:

In summary, the overall loss function of the first stage is:

where λc and λr are tunable hyperparameters to make tradeoffs between the content and reconstruction loss. In our experiments, this loss function can be optimized at low operating cost, to obtain sufficiently good representations, especially for the identification of the batch-specific cells. However, the overwhelming researches in the field of deep learning have confirmed that it is hard to generate images indistinguishable from true ones by only optimizing the reconstruction loss of autoencoders [40], which inspired us to add the adversarial structures in the stage II, further removing the batch effects from the original expression profiles.

Although in the ideal case, the representations built from the previous stage should be independent of the batch effects, according to our trials, it is hard to retrieve the corrected expression profiles only by the generators G1 and G2. Therefore, we further use a GAN-based model to almost perfectly match the data distributions of the shared cell types across different batches and then generate the corrected expression profiles in the stage II. The basic idea here is to transform cells from all other batches to pseudo-cells of one pre-selected “anchor” batch, and the pseudo-cells are expected to be indistinguishable from true cells of the anchor batch. By indistinguishableness, we do not pursue perfect overlap with true cells for each single pseudo-cell, but endeavor to match the distribution of pseudo-cells with the distribution of true cells with the same or similar biological contents.

We adopt a specialized MNN pair-based strategy to guide the integration, for only matching the distributions of cells from the shared cell types between two batches. An MNN pair is defined as a set of two cells from two batches respectively, that each cell is among the k nearest across-batch neighbors of the other cell [9]. We use the encoder output E(x) from the stage I to define MNN pairs, because these representations are supposed to be batch effect independent, resulting in a larger number of MNN pairs than using the original expression vectors, as we shown in Fig. 3e. Other methods based on MNN pairs may regard these pairs as anchors and then use a weighted averaging strategy to correct all other cells. One major potential drawback of the MNN pairs is that it is hard to assure these pairs could cover the complete distributions of cells from the shared cell types (Fig. 1d). We alternatively develop a novel random walk-based strategy to expand the MNN pair list. As shown in Fig. 1d, suppose cell a1 from batch 1 and cell a2 from batch 2 are selected as an MNN pair. Among the k1 nearest neighbors of a1 from batch 1, we randomly pick one cell b1. The same procedure would give one b2 cell from batch 2. Then, the set composed of b1 and b2 is regarded as an extended MNN pair, and also the next seed pair for random walk expansion. This process is repeated m times. For all MNN pairs, we could generate these kinds of new pairs. We call pairs obtained from this procedure as rwMNN pairs. The generated rwMNN pairs can better cover the distributions of matched cell types, which could facilitate the training of GANs (Fig. 3f). We argue that it is also beneficial to adopt rwMNN pairs for other MNN-based methods (Additional file 1: Fig. S11).

Next, we use those rwMNN pairs, denoted as x1x2ii=1M (the superscript indexing its batch origin) to train the GAN model. This model is composed of two neural networks, one generator G, mapping cell expression vector x(1) to a pseudo-cell expression vector G(x(1)), and one discriminator D, discriminating the pseudo cell from the true expression vector x(2). The adversarial loss is:

After training, all cells including those not in the rwMNN list could be transformed by the generator G to obtain the batch effect removal expression vectors.

We deploy a total of five neural networks. Compared with the network structures, the specific number of neurons for each layer is of less importance for a reasonable number of input dimensions. By default, the encoder E from the first stage is a d → 1024 → 512 → l three-layer (not including the input layer) network (d is the input dimension of expression vectors, and l is the dimension of content representations). The decoder G1 is a n → 512 → 1024 → d three-layer network (n is the number of batches), and the decoder G2 is (n + l) → 512 → 1024 → d. For all networks, the first two layers have a Mish non-linear activation [41], while the last layer is a linear transformation. Two parameters λc = 3, λr = 1 are used to balance the reconstruction loss and content loss. For the second stage, the generator G is a “shortcut connection” inspired by ResNet [42], which means G(x) = f(F(x) + x) (f is a ReLU function), and F itself is an autoencoder structure, d → 1024 → 512 → l → 512 → 1024 → d (all layers are activated by Mish except the middle one). Be default, l is set to 256. The discriminator D is again a three-layer network d → 512 → 512 → 1. To facilitate and stabilize the GAN training process, adversarial losses are optimized via the WGAN-GP [43]. We adopt the Adam optimizer [44] to train the networks, with the learning rate 0.0005 for first stage and 0.0002 for the second.

In the stage II, we need to enquire the kNNs within batch and MNN pairs between batches for cells. This procedure may be compute-intensive. We randomly sample a maximum of s = 3000 cells from each batch to calculate all necessary pairs. Then, a locality sensitive hashing-based Python package “annoy” is adopted to quickly find the approximate nearest neighbors of each cell [45]. These make the time cost of the enquiry process is approximately constant with respect to the number of cells in each batch. The overall time cost depends only on the number of batches and network optimization parameters (such as the number of epochs for training). Hyperparameters used in this stage include k1 = s/100, k = k1/2, m = 50. All hyperparameters can be tunable by the user, although the default options could provide good enough results in most of our tested cases.

In order to deal with multiple datasets, we use an incremental matching manner. The sub-dataset with the largest total variance is selected as the anchor, and all other sub-datasets are processed in the decreasing order of their total variances. Every sub-dataset integrated to the anchor is appended to the anchor. Intuitively, the preferential integration order should arrange those sub-datasets with larger number of cell types firstly. If this information is available, we encourage the users to provide their own anchor and integration ordering. However, we argue that iMAP can also perform well to some extent even if the anchor sub-dataset lacks specific cell types. We demonstrate this in the “panc_rm” dataset, where the “inDrop” batch was selected as the anchor.

All jobs were run on a Linux server with 2x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, 256G of DDR4 memory, Nvidia GTX 1080Ti GPU.

The interpretability has gradually become more and more important in the machine learning community, particularly for the applications on biological researches [46]. We adopt SHAP [26], a well-designed game theory-based method to interpret the trained neural networks of iMAP, through grading each gene to evaluate its importance for the outputs of one cell. We provide two kinds of scores to interpreting gene’s importance, each for building representations in the stage I and removing batch effects in the stage II, respectively. Specially, a three-layer neural network is connected to the output of encoder from the stage I, and trained to classify the external cell type information, while the encoder would not be trained further. Then SHAP is used to evaluate the importance of each gene for the classification outputs (Additional file 1: Fig. S1c). The other three-layer neural network is deployed to discriminate the batch-origin of each cell, and SHAP is again used to assign each gene an importance value for the classification of batch (Additional file 1: Fig. S1d). This importance value is regarded as the surrogate for evaluating the importance of one gene in removing batch effects. We expect these two importance scores could offer primary heuristics about the working mechanisms of iMAP, and the roles of genes on representing biological variations and measuring noises.

Preprocessing of scRNA-seq datasets were performed under the standard Scanpy pipelines [47]. Low-quality cells were filtered if the library size or the proportion of mitochondrial gene counts was too large. The input expression vectors for iMAP were log-transformed TPM-like values. Prior to the first stage, we need to select highly variable genes for each batch to help discover the true biological variations from the noisy transcriptomes. Only those genes measured in all batches were considered. The Scanpy API “scanpy.pp.highly_variable_genes” was used to select highly variable genes in each batch respectively, although users could also use their preferred highly variable genes selection method. For the second stage, by default, we also used the selected highly variable genes. It is also possible to deal with the whole transcriptome, as we did for the “CRC” dataset. However, to make our default network structure, which is specially designed for dealing with inputs of highly variable genes (usually about two thousand), suitable to the whole transcriptome (usually about twenty thousand genes), we randomly divided the whole transcriptome into ten parts, each with about two thousand genes, and trained ten separate networks for each of them (Additional file 1: Table S5). We did not take the pre-processing steps into account to measure the time cost shown in Fig. 4d, e.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.