We deploy a total of five neural networks. Compared with the network structures, the specific number of neurons for each layer is of less importance for a reasonable number of input dimensions. By default, the encoder E from the first stage is a d → 1024 → 512 → l three-layer (not including the input layer) network (d is the input dimension of expression vectors, and l is the dimension of content representations). The decoder G1 is a n → 512 → 1024 → d three-layer network (n is the number of batches), and the decoder G2 is (n + l) → 512 → 1024 → d. For all networks, the first two layers have a Mish non-linear activation [41], while the last layer is a linear transformation. Two parameters λc = 3, λr = 1 are used to balance the reconstruction loss and content loss. For the second stage, the generator G is a “shortcut connection” inspired by ResNet [42], which means G(x) = f(F(x) + x) (f is a ReLU function), and F itself is an autoencoder structure, d → 1024 → 512 → l → 512 → 1024 → d (all layers are activated by Mish except the middle one). Be default, l is set to 256. The discriminator D is again a three-layer network d → 512 → 512 → 1. To facilitate and stabilize the GAN training process, adversarial losses are optimized via the WGAN-GP [43]. We adopt the Adam optimizer [44] to train the networks, with the learning rate 0.0005 for first stage and 0.0002 for the second.

In the stage II, we need to enquire the kNNs within batch and MNN pairs between batches for cells. This procedure may be compute-intensive. We randomly sample a maximum of s = 3000 cells from each batch to calculate all necessary pairs. Then, a locality sensitive hashing-based Python package “annoy” is adopted to quickly find the approximate nearest neighbors of each cell [45]. These make the time cost of the enquiry process is approximately constant with respect to the number of cells in each batch. The overall time cost depends only on the number of batches and network optimization parameters (such as the number of epochs for training). Hyperparameters used in this stage include k1 = s/100, k = k1/2, m = 50. All hyperparameters can be tunable by the user, although the default options could provide good enough results in most of our tested cases.

In order to deal with multiple datasets, we use an incremental matching manner. The sub-dataset with the largest total variance is selected as the anchor, and all other sub-datasets are processed in the decreasing order of their total variances. Every sub-dataset integrated to the anchor is appended to the anchor. Intuitively, the preferential integration order should arrange those sub-datasets with larger number of cell types firstly. If this information is available, we encourage the users to provide their own anchor and integration ordering. However, we argue that iMAP can also perform well to some extent even if the anchor sub-dataset lacks specific cell types. We demonstrate this in the “panc_rm” dataset, where the “inDrop” batch was selected as the anchor.

All jobs were run on a Linux server with 2x Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz, 256G of DDR4 memory, Nvidia GTX 1080Ti GPU.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.