To integrate heterogeneous data from different platforms and different studies, a three-step procedure was applied. First, the percentile rank from the above "clustering per dataset" procedure was used for the identification of informative genes in the combined datasets. Specifically, the median of percentile ranks across datasets was calculated and genes were ordered by the median ascendingly. Then excluding genes in the blacklist, the top 1500 genes were identified as informative genes. Within each dataset, for each gene, the normalized expression was corrected for cell cycle effect, donor effect, percentage of mitochondrial UMI counts, and DIG signature, and then scaled to z-score. Second, to reduce technical noise such as transcripts drop-out, we partitioned single cells into small groups (called mini-clusters hereafter) each of which contained similar cells. This strategy is similar to the MetaCell method, but our pipeline is compatible with gene expression data measured in CPM/TPM while MetaCell requires counts data as input. Specifically, within each dataset, the Seurat v3 pipeline was applied to the z-score matrix of the informative genes. The parameter k for the k-nearest neighbor algorithm was changed from the default value 20 to 10, and the resolution for Louvain clustering was set to a high resolution of 50 (for datasets with < 500 cells, 25 was used instead). Thus, clusters with small sizes were identified as mini-cluster. Then the z-score transformed gene expression was averaged per min-cluster. Thus, the original gene by cell expression matrix was converted to the gene by mini-cluster expression matrix. Such matrices of all datasets were combined by column and only genes present in all datasets were kept. The combined matrix would be used for downstream analysis. Third, Harmony was applied immediately after PCA, which was based on the combined matrix of the informative genes. Then Uniform Manifold Approximation and Projection (UMAP) and clustering (both implemented in the Seurat v3 pipeline) were performed on the "harmony space" to identify clusters of mini-clusters (called meta-clusters hereafter).
Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:
Zheng, L, Qin, S, Hu, X and Zhang, Z(2022). Data integration and meta-cluster identification. Bio-protocol Preprint. bio-protocol.org/prep1500.
Zheng, L., Qin, S., Si, W., Wang, A., Xing, B., Gao, R., Ren, X., Wang, L., Wu, X., Zhang, J., Wu, N., Zhang, N., Zheng, H., Ouyang, H., Chen, K., Bu, Z., Hu, X., Ji, J. and Zhang, Z.(2021). Pan-cancer single-cell landscape of tumor-infiltrating T cells. Science 374(6574). DOI: 10.1126/science.abe6474
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this
article to respond.
0/150
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Spinning
Post a Question
0 Q&A
Spinning
This protocol preprint was submitted via the "Request
a Protocol" track.