Three commonly used scRNA-seq datasets were employed to evaluate the performance of different batch effect removal methods. The first dataset “panc_rm,” includes human pancreas cells measured by 5 different platforms. To measure the ability of different methods to detect the batch-specific cell types, we manually removed “ductal” cells from the “CEL-seq” dataset and “acinar,” “alpha” cells from the “inDrop” dataset. The “ductal” cell type has the largest number of cells in the “CEL-seq” sub-dataset. With their removal, the primary variance of “CEL-seq” may be determined by the second and third most numerous cell type, i.e., “acinar” and “alpha” cells. Then, we further removed these two cell types from another “inDrop” sub-dataset which was selected as the integration anchor. The second dataset “cell_lines,” is composed of three sub-datasets all sequenced by the 10x platform. Two of them are pure cell lines (“Jurkat” and “293 T”), and “Mix” is the equal mixture of “Jurkat” and “293 T.” For the “Mix” dataset, we performed the standard “Seurat” pipeline to cluster and annotate the cells. Those clusters with high expression of XIST were set as “293 T” while others as “Jurkat.” The third dataset “DC_rm,” consists of human DCs sequenced by the Smart-seq2 protocol. CD1C DCs in batch 1 and CD141 DCs in batch 2 were also removed, which are biologically similar.

Two recently published benchmark datasets “SCP424_PBMC” and “SCP425_cortex,” which sequenced thousands of cells from peripheral blood mononuclear cells and brain tissue respectively, with over ten protocols, covering most of single-cell and/or single-nucleus profiling methods, were also included for comparison of different methods. The log-10 K data, and meta information were downloaded from the Single Cell Portal (; Additional file 1: Table S1). We also tested the performance of iMAP on five additional datasets, with various numbers of cells, and detailed information can be found in Additional file 1: Table S1.

To test the performance, especially the time cost of iMAP for large-scale datasets, we ran iMAP on the Tabila Muris dataset, which consists of the mouse cells sequenced by two platforms, e.g., Smart-seq2 and 10x. The “UpdateSeuratObject” function updated the downloaded Seurat object to the version v3. The sequencing platforms were regarded as the batches. Another dataset containing over 600,000 cells from Human Cell Atlas was also adopted to test the scalability of iMAP, and its detailed information can be found in Additional file 1: Table S1.

The “CRC” dataset was used to test the applications of iMAP on the tumor microenvironments. Nearly 50,000 cells from human colon cancer were sequenced by either Smart-seq2 or 10x platforms. Cells from different patients sequenced by Smart-seq2 show less technical variations than those by 10x [30]. Therefore, we regarded all cells from Smart-seq2 as a single batch, and every patient sequenced by 10x was a separate batch. Cell types and tissue sources information were obtained from the original publication.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.