scRNA-Seq data preprocessing

Liangtao Zheng; Shishang Qin; Xueda Hu; Zemin Zhang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

scRNA-Seq data preprocessing

LZ Liangtao Zheng

SQ Shishang Qin

XH Xueda Hu

ZZ Zemin Zhang email

Last updated date: Jan 12, 2022 Views: 1058 Forks: 0

An abbreviated version of this protocol was published in Science in Dec, 2021

Pan-cancer single-cell landscape of tumor-infiltrating T cells

Download PDF

Ask a question

How to cite

Favorite

To reduce noise in the data, we performed two-run clustering on each dataset. Examining the result from the first run clustering, we identified contamination clusters and clusters that arose from unwanted factors. The second run clustering without such noise prepared the basis for data integration across multiple datasets. Specifically, for each dataset from droplet-based technology, if a UMI-count table was available, we re-normalized the data using the R package SCTransform (version 0.3.2.9004), which is tailored for droplet-based scRNA-Seq data and much more robust to noise compared with the default normalization method of Seurat. Genes were ranked descendingly by variance (for TPM data) or residual variance estimated from the "vst" method implemented in the FindVariableFeatures function of Seurat (for CPM, SCTransform, or scran normalized data). Excluding genes in a blacklist (described later), the top 1500 genes were identified as highly variable genes (HVG). Then we ran the Seurat pipeline for the first time: regressed out cell cycle effect and donor effect; performed principal component analysis (PCA) using the highly variable genes; built a Shared Nearest Neighbor (SNN) Graph using the top 15 principal components and clustered cells using the Louvain algorithm. For all data sets, the resolution parameter of clustering was set to 2.0, which could produce sufficient fine clustering according to our experiments. All other parameters were kept as the default values. After clustering, limma (version 3.42.2) was used for detecting differentially expressed genes. For each cluster, cells were compared with cells from all other clusters. Genes with log2 fold change larger than a specified threshold (1.00 and 0.25 for SmartSeq2-based datasets and droplet-based datasets respectively) and false discovery rate (FDR) < 0.01 were defined as the marker genes of the cluster. Also, analysis of variance (ANOVA) was performed to obtain F value using R package aov.

From the first run clustering, in multiple datasets, we identified clusters with marker genes associated with tissue dissociation operation including heat shock protein-encoding genes (50). Recurrent marker genes of these tissue dissociation-related clusters, i.e. those that showed significant high expression in more than 40% tissue dissociation-related clusters, were defined as disassociation induced genes (DIG) and were included in the gene blacklist after the first run clustering. Further, we found that for a few datasets, there were clusters showing signature of macrophage (high expression of "LYZ","C1QA","C1QB"), alv2 cell (high expression of "SFTPC","SFTPA1","SFTPA2"), melanocyte (high expression of "GPNMB", "TYR", "PMEL", "MLPH") or B cell. The potential contamination or doublet cells were removed by dropping the clusters (for B cell) or by filtering out cells with high average expression of the signature genes of the contaminating cell types (macrophage, alv2 cell, and melanocyte).

Then we ran the Seurat pipeline for the second time, with the gene blacklist including DIG. The unwanted effects caused by the cell cycle, donor, percentage of mitochondrial UMI counts, and DIG signature were removed by regression in this run. After the pipeline was completed successfully, the genes were ranked by F value from ANOVA in descent order, and then the rank values were converted to percentile rank values (by dividing by the total number of genes). The filtered gene expression matrix and percentile rank would be used in the downstream data integration.

The gene blacklist contains immunoglobulin genes, T cell receptor (TCR) genes from R package biomaRt (version 2.42.1); ribosome-protein-coding genes (gene symbol with string pattern "^RP([0-9]+-|[LS])"), gene MALAT1 , proliferation genes, and dissociation induced genes from the first run of the Seurat pipeline. Processes including cell cycle and tissue dissociation may influence the expression of not only the associated signature genes, but also the whole transcriptome. To minimize the impact of those processes, the S phase score and G2M phase score were calculated with function CellCycleScoring , and the DIG score was calculated with function AddModuleScore . Then those scores were regressed out in the Seurat pipeline.

The code implemented the pipeline could be found in github (https://github.com/Japrin/scPip).

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

Zheng, L, Qin, S, Hu, X and Zhang, Z(2022). scRNA-Seq data preprocessing. Bio-protocol Preprint. bio-protocol.org/prep1503.
Zheng, L., Qin, S., Si, W., Wang, A., Xing, B., Gao, R., Ren, X., Wang, L., Wu, X., Zhang, J., Wu, N., Zhang, N., Zheng, H., Ouyang, H., Chen, K., Bu, Z., Hu, X., Ji, J. and Zhang, Z.(2021). Pan-cancer single-cell landscape of tumor-infiltrating T cells. Science 374(6574). DOI: 10.1126/science.abe6474