Data preprocessing and the first-time unsupervised clustering

Zechuan Chen; Zeruo Yang; Xiaojun Yuan; Xiaoming Zhang; Pei Hao

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Data preprocessing and the first-time unsupervised clustering

ZC Zechuan Chen

ZY Zeruo Yang

XY Xiaojun Yuan

XZ Xiaoming Zhang

PH Pei Hao

This method is extracted from research article: BMC Bioinformatics, Apr 2021

scSensitiveGeneDefine: sensitive gene detection in single-cell RNA sequencing data by Shannon entropy

DOI: 10.1186/s12859-021-04136-1

Request a Protocol

Ask a question

Favorite

After QC, we used Seurat package (Version 3.1.5) in R (Version 3.6.3) to perform the same analysis pipeline for all scRNA-seq data sets. By default, we employed a global-scaling normalization method “LogNormalize” that normalized the feature expression measurements for each cell by the total expression, multiplied this by a scale factor (10,000), and log-transformed the result. Second, to avoid the interference from doublet cells, we identified and removed these doublet cells by using DoubletFinder [25] package (Version 2.0.3) in R. Third, we calculated CV-rank for each gene in all cells and used the top 2000 genes with the highest CV-rank for the downstream analyses, including principal component analysis (PCA) and unsupervised clustering (the Louvain algorithm) [26]. Then, we performed PCA to identify the true dimension of data sets, and we chose as many principal components as possible for the downstream analyses. As for the unsupervised clustering, we chose 0.6 as the default resolution parameter, and this clustering result was defined as the first-time unsupervised clustering result (Fig. 1a–c).

Workflow for sensitive gene identification. a After the single-cell sequencing, we obtained expression profiles of various cell types, with different colors representing different cell types. We used Seurat to calculate the CV-rank for all genes in all cells, and the top 2000 genes were defined as HVGs (red); b Based on the results of the first-time unsupervised clustering, we detected high CV-rank genes in each cluster; c Shannon entropy based on the average expressions of these genes (with high CV-rank in more than half of clusters) among cells in each cluster. The genes with high entropy (higher than the median entropy) were regarded as the sensitive genes; d We re-selected the top 2000 HVGs with sensitive genes removed from the expression matrix

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol