2.1 Data description

Zhe Sun; Ting Wang; Ke Deng; Xiao-Feng Wang; Robert Lafyatis; Ying Ding; Ming Hu; Wei Chen

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.1 Data description

ZS Zhe Sun

TW Ting Wang

KD Ke Deng

XW Xiao-Feng Wang

RL Robert Lafyatis

YD Ying Ding

MH Ming Hu

WC Wei Chen

This method is extracted from research article: Bioinformatics, Aug 2017

DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data

DOI: 10.1093/bioinformatics/btx490

Request a Protocol

Ask a question

Favorite

The droplet-based scRNA-Seq data can be summarized into a UMI count matrix (Table 1), in which each row represents one gene and each column represents one single cell. Each entry in the UMI count matrix is the number of transcripts (unique UMIs) for one gene in one single cell. Compared to the data generated from early generation of scRNA-Seq technologies, droplet-based scRNA-Seq data have three important features (Gawad et al., 2016; Stegle et al., 2015; Zheng et al., 2017). First, each experiment can generate thousands of cells, which dramatically increase the data dimension and computational burden. Second, the use of UMI can reduce PCR amplification bias and quantify the copies of captured molecules. Droplet-based sequencing protocol amplifies the 3′ end of the transcript, so the number of UMI is independent of the total transcript length. The normalization method used in RPKM and FPKM, which adjusts for total transcript length, is invalid for analyzing droplet-based scRNA-Seq data. Therefore, the raw count data should be directly modeled to retain their biological interpretations. Third, the UMI count matrix is extremely sparse, and thus violates the statistical assumption of many existing clustering methods. Supplementary Figure S1 lists the empirical distribution of the UMI counts for a few representative genes, demonstrating the non-ignorable proportion of zeroes for different levels of expression. Pre-selection of informative single cells and informative genes are necessary before the downstream clustering analysis. After clustering analysis, the results are usually visualized by a t-distributed stochastic neighbor embedding (t-SNE) approach (van der Maaten and Hinton, 2008), which embeds high-dimensional transcriptome data into a two-dimensional scatter plot. Note that t-SNE is a visualization tool, and it is not intended to be used for clustering scRNA-Seq data.

An example of the raw UMI count table from droplet-based scRNA-Seq data

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol