Processing of scRNA-Seq Data

MS Miao Su
KQ Kuang-Yuan Qiao
XX Xiao-Li Xie
XZ Xin-Ying Zhu
FG Fu-Lai Gao
CL Chang-Juan Li
DZ Dong-Qiang Zhao
request Request a Protocol
ask Ask a question
Favorite

R language scripts were written to analyze scRNA-seq data. The counts files were read into R and formatted; averages were obtained for duplicated genes, and transcriptome sequence data of ICC cells and adjacent tissue cells were merged into a matrix. We used the statistical R package “Seurat” to process the data, including data quality control, gene and cell filtration, normalization, variable gene finding, data scaling, principal component analysis (PCA), and t-distributed stochastic neighbor embedding (t-SNE) algorithms. All default parameters were left unchanged unless otherwise specified. First, the single-cell data were processed by CreateSeuratObject function (arguments: min. cells = 3, min. features = 200) to create the object. Meanwhile, cells with poor quality were excluded. Only genes detected in more than three cells and cells with more than 200 detected genes were used in the following analysis. We conducted quality control using PercentageFeatureSet function (arguments: pattern = MT-), which could calculate gene number, gene types number, and percentage of mitochondrial genes. The correlation between sequenced genes number and sequenced genes types was calculated with FeatureScatter function. The results were also visualized. Second, to exclude non-cells or cell aggregates, subset function was used to further screen samples with the selective criteria of gene expression types of more than 500, gene expression levels of more than 1,000 and fewer than 20,000, and mitochondrial proportion restricted to <20%. The data were log-normalized with NormalizeData function, and the top 1,500 variable genes were identified using the FindVariableFeatures function (arguments: selection.method = vst, nfeatures = 1,500) for subsequent analysis. Third, we used the ScaleData function (vars.to.regress = percent.mt) to mitigate this source of variation in the dataset. PCA was performed by RunPCA function for dimension reduction. After calculation with the JackStraw function, the JackStrawPlot (dims = 1:20) and ElbowPlot functions (ndims = 40) were used to identify the number of significant principal components (PCs) to use for clustering. Through plot visualization, the top 20 PCs were selected for the next analysis. Lastly, cell populations were clustered by t-SNE algorithm. FindClusters function with resolution of 0.5 was performed, and RunTSNE function was used to generate clusters. The FindAllMarkers function (arguments: min.pct = 0.25, logfc.threshold = 0.25) was used to find markers by comparing each cluster with all others; different genes between two identities were identified using the FindMarkers function. The feature plot and heatmap visualization of gene expression were generated using the Seurat function FeaturePlot and DoHeatmap, respectively. Cell type–specific marker genes were taken from published literature (Zhang et al., 2020) and were compared with our analysis results to define the cluster type. Clusters consisting of immune cells were extracted and processed again in the same way as above, and each immune cell type was further divided into subclusters. Marker genes of each immune cell type were identified by comparing ICC subclusters with normal subclusters, and adjustment of P-value (adjPval) <0.05 was regarded as the cutoff criteria. The marker genes of each immune cell type were incorporated as DEGs.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A