Data integration and meta-cluster identification

Liangtao Zheng; Shishang Qin; Xueda Hu; Zemin Zhang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

Data integration and meta-cluster identification

LZ Liangtao Zheng

SQ Shishang Qin

XH Xueda Hu

ZZ Zemin Zhang email

Last updated date: Jan 12, 2022 Views: 1072 Forks: 0

An abbreviated version of this protocol was published in Science in Dec, 2021

Pan-cancer single-cell landscape of tumor-infiltrating T cells

Download PDF

Ask a question

How to cite

Favorite

To integrate heterogeneous data from different platforms and different studies, a three-step procedure was applied. First, the percentile rank from the above "clustering per dataset" procedure was used for the identification of informative genes in the combined datasets. Specifically, the median of percentile ranks across datasets was calculated and genes were ordered by the median ascendingly. Then excluding genes in the blacklist, the top 1500 genes were identified as informative genes. Within each dataset, for each gene, the normalized expression was corrected for cell cycle effect, donor effect, percentage of mitochondrial UMI counts, and DIG signature, and then scaled to z-score. Second, to reduce technical noise such as transcripts drop-out, we partitioned single cells into small groups (called mini-clusters hereafter) each of which contained similar cells. This strategy is similar to the MetaCell method, but our pipeline is compatible with gene expression data measured in CPM/TPM while MetaCell requires counts data as input. Specifically, within each dataset, the Seurat v3 pipeline was applied to the z-score matrix of the informative genes. The parameter k for the k-nearest neighbor algorithm was changed from the default value 20 to 10, and the resolution for Louvain clustering was set to a high resolution of 50 (for datasets with < 500 cells, 25 was used instead). Thus, clusters with small sizes were identified as mini-cluster. Then the z-score transformed gene expression was averaged per min-cluster. Thus, the original gene by cell expression matrix was converted to the gene by mini-cluster expression matrix. Such matrices of all datasets were combined by column and only genes present in all datasets were kept. The combined matrix would be used for downstream analysis. Third, Harmony was applied immediately after PCA, which was based on the combined matrix of the informative genes. Then Uniform Manifold Approximation and Projection (UMAP) and clustering (both implemented in the Seurat v3 pipeline) were performed on the "harmony space" to identify clusters of mini-clusters (called meta-clusters hereafter).

The code implemented the pipeline could be found in github (https://github.com/Japrin/scPip).

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Zheng, L, Qin, S, Hu, X and Zhang, Z(2022). Data integration and meta-cluster identification. Bio-protocol Preprint. bio-protocol.org/prep1500.
Zheng, L., Qin, S., Si, W., Wang, A., Xing, B., Gao, R., Ren, X., Wang, L., Wu, X., Zhang, J., Wu, N., Zhang, N., Zheng, H., Ouyang, H., Chen, K., Bu, Z., Hu, X., Ji, J. and Zhang, Z.(2021). Pan-cancer single-cell landscape of tumor-infiltrating T cells. Science 374(6574). DOI: 10.1126/science.abe6474

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

This protocol preprint was submitted via the "Request a Protocol" track.

Share your protocol with your peers.

Submit a Preprint Protocol