Single-cell RNA-seq data analyses

SC Silvia Carvalho
LZ Luna Zea-Redondo
TT Tsz Ching Chloe Tang
PS Philipp Stachel-Braum
DM Duncan Miller
PC Paulo Caldas
AK Alexander Kukalev
SD Sebastian Diecke
SG Stefanie Grosswendt
AG Ana Rita Grosso
AP Ana Pombo
request Request a Protocol
ask Ask a question
Favorite

The four 10x Genomics scRNA-seq libraries were simultaneously mapped to the mouse genome (refdata-gex-mm10-2020-A version provided by cellranger) using the cellranger count function from 10× Genomics Cellranger software (version cellranger-6.1.2) with default parameters, except -expect-cells 10,000 (total number of cells targeted per experiment). The four 10x Genomics scRNA-seq libraries were aggregated on one unique count matrix using cellranger aggr function from 10X Genomics Cellranger software (version cellranger-6.1.2) with default parameters. The analysis of the single-cell data was performed using Seurat version 4.1.0. To remove low-quality cells, we applied sample-specific thresholds for both number of UMIs and number of features, by subtracting one standard deviation from the mean value across all cells of UMIs or features, respectively. Cells that did not pass both of these two thresholds were removed from the analyses. Additionally, cells having higher than 10% of mitochondrial reads were also excluded.

Single-cell datasets were normalised using the LogNormalize method, with a scale factor of 10,000, and the top 2000 variable genes were calculated using the vst method. Lastly, the data were scaled using the ScaleData function from Seurat, using all genes as input.

PCA analysis was performed by running the RunPCA function from Seurat with default parameters, and it was followed by the integration of all samples using Harmony (version 1.0; Korsunsky et al., 2019). The batch-corrected data was subjected to uniform manifold approximation and projection (UMAP) using the first ten dimensions. To explore distinct biologically meaningful groupings, we proceed with clustering by running FindNeighbors (first 20 dimensions) and FindClusters (0.2 resolution) from the Seurat package. To estimate the cell cycle phase of each cell in our study, we used the project_cycle_space and estimate_cycle_position from the tricycle package (version 1.2.1). For simplicity, numerical values were converted to categorical labels following the developer's recommendations: cells with a tricycle position between 0.5pi and pi were classified as ‘S’ phase, those between pi and 1.75pi as ‘G2M’ phase, those greater than or equal to 1.75pi or less than 0.25pi as ‘G1/G0’ phase, and the rest as ‘NA’.

To identify cluster-specific marker genes that may characterise all the different cellular populations, we performed differential gene expression analysis using the FindAllMarkers function in Seurat. This analysis was conducted using the Wilcoxon rank-sum test and considering only positively regulated genes detected in at least 10% of cells within each cluster with a minimum fold change of 0.25. We excluded clusters that were composed of less than 2% of the total number of cells in the dataset, specifically clusters 5 and 6. Analyses of the expression of pluripotency markers showed a progressive reduction in its levels along the identified clusters (Fig. 3; Fig. S3). This suggests that the UMAP projection captures a progressive state that underlies a loss of stemness from cluster 1 to cluster 4 and that a trajectory analysis would not provide additional insight into the cellular dynamics of our system. Pairwise differential expression analysis between cluster 2 (mostly composed of Srrm2+/− cells, making up 95% of the cluster) and cluster 1 (mostly containing WT cells, making up 94% of the cluster) was performed using FindMarkers function (Seurat). Genes with Log2 Fold change>0 represent upregulated genes in cluster 2, whereas genes with Log2 Fold change<0 are upregulated in cluster 1. Differentially expressed genes were expressed in at least 50% of the cells in at least one of the two clusters being compared (min.pct=0.5).

To investigate the pluripotency state of the single cells, we assessed the expression of gene markers, as previously described (Wang et al., 2021a). Over-representation analysis for biological terms (http://geneontology.org/) and phenotype terms (https://www.informatics.jax.org/vocab/mp_ontology/) were performed with WEB-based GEne SeT AnaLysis Toolkit (WebGestalt 2019), using the default parameters and weighted set to reduce redundancy of the gene sets in the enrichment result (Zhang et al., 2005; Wang et al., 2013, 2017; Liao et al., 2019). Gene markers were provided as input against a gene universe of all genes expressed in at least ten of the cells that were included in the analysis. The top five terms with a higher normalised enrichment score (NES) for each cluster were selected to produce the heatmaps shown in Fig. 3F and Fig. S3E.

Mice in vivo wild-type scRNA-seq data and cell type annotations from E3.5-5.5 and E6.5-8.5 development stages were acquired from GSE123046 (Nowotschin et al., 2019) and GSE137337 (Chan et al., 2019; Grosswendt et al., 2020; Haggerty et al., 2021) repositories, respectively. Data normalisation was conducted using the Seurat package's NormalizeData function (v. 4.3.0.1). For the E3.5-5.5 reference dataset, ICM cells were identified and annotated de novo based on increased and specific expression of key pluripotency genes (Pou5f1, Sox2, Nanog, Klf2, Klf5). The most similar in vivo cell types were identified for wild-type and Srrm2+/− cells using the scmapCluster function (Scmap v. 1.22.3 using 2000 highly variable features and an assignment threshold of 0.5) of the Scmap cell-cluster label transfer method (Kiselev et al., 2018).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A