2.1 Data description

ZS Zhe Sun
TW Ting Wang
KD Ke Deng
XW Xiao-Feng Wang
RL Robert Lafyatis
YD Ying Ding
MH Ming Hu
WC Wei Chen
request Request a Protocol
ask Ask a question
Favorite

The droplet-based scRNA-Seq data can be summarized into a UMI count matrix (Table 1), in which each row represents one gene and each column represents one single cell. Each entry in the UMI count matrix is the number of transcripts (unique UMIs) for one gene in one single cell. Compared to the data generated from early generation of scRNA-Seq technologies, droplet-based scRNA-Seq data have three important features (Gawad et al., 2016; Stegle et al., 2015; Zheng et al., 2017). First, each experiment can generate thousands of cells, which dramatically increase the data dimension and computational burden. Second, the use of UMI can reduce PCR amplification bias and quantify the copies of captured molecules. Droplet-based sequencing protocol amplifies the 3′ end of the transcript, so the number of UMI is independent of the total transcript length. The normalization method used in RPKM and FPKM, which adjusts for total transcript length, is invalid for analyzing droplet-based scRNA-Seq data. Therefore, the raw count data should be directly modeled to retain their biological interpretations. Third, the UMI count matrix is extremely sparse, and thus violates the statistical assumption of many existing clustering methods. Supplementary Figure S1 lists the empirical distribution of the UMI counts for a few representative genes, demonstrating the non-ignorable proportion of zeroes for different levels of expression. Pre-selection of informative single cells and informative genes are necessary before the downstream clustering analysis. After clustering analysis, the results are usually visualized by a t-distributed stochastic neighbor embedding (t-SNE) approach (van der Maaten and Hinton, 2008), which embeds high-dimensional transcriptome data into a two-dimensional scatter plot. Note that t-SNE is a visualization tool, and it is not intended to be used for clustering scRNA-Seq data.

An example of the raw UMI count table from droplet-based scRNA-Seq data

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A