RNA-Seq data analysis

Xiao Shi; Yaoting Sun; Cenkai Shen; Yan Zhang; Rongliang Shi; Fan Zhang; Tian Liao; Guojun Lv; Zhengcai Zhu; Lianghe Jiao; Peng Li; Tiansheng Xu; Ning Qu; Naisi Huang; Jiaqian Hu; Tingting Zhang; Yanzi Gu; Guangqi Qin; Haixia Guan; Weilin Pu; Yuan Li; Xiang Geng; Yan Zhang; Tongzhen Chen; Shenglin Huang; Zhikang Zhang; Shuting Ge; Wu Wang; Weibo Xu; Pengcheng Yu; Zhongwu Lu; Yulong Wang; Liang Guo; Yu Wang; Tiannan Guo; Qinghai Ji; Wenjun Wei

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

RNA-Seq data analysis

XS Xiao Shi

YS Yaoting Sun

CS Cenkai Shen

YZ Yan Zhang

RS Rongliang Shi

FZ Fan Zhang

TL Tian Liao

GL Guojun Lv

ZZ Zhengcai Zhu

LJ Lianghe Jiao

PL Peng Li

TX Tiansheng Xu

NQ Ning Qu

NH Naisi Huang

JH Jiaqian Hu

TZ Tingting Zhang

YG Yanzi Gu

GQ Guangqi Qin

HG Haixia Guan

WP Weilin Pu

YL Yuan Li

XG Xiang Geng

YZ Yan Zhang

TC Tongzhen Chen

SH Shenglin Huang

ZZ Zhikang Zhang

SG Shuting Ge

WW Wu Wang

WX Weibo Xu

PY Pengcheng Yu

ZL Zhongwu Lu

YW Yulong Wang

LG Liang Guo

YW Yu Wang

TG Tiannan Guo

QJ Qinghai Ji

WW Wenjun Wei

This method is extracted from research article: Cell Discov, Nov 2022

Integrated proteogenomic characterization of medullary thyroid carcinoma

DOI: 10.1038/s41421-022-00479-y

Request a Protocol

Ask a question

Favorite

The raw data from 101 tumor samples were then demultiplexed and converted into FASTQ format. Sequencing reads were checked for quality and adaptor/primer sequence contamination first by FastQC (version 0.11.5). Before filtering, the proportion of sequenced bases with ≥ Q30 value ranged from 90.66% to 96.03% (average: 93.63%), while GC-content ranged from 45.22% to 62.33% (average: 49.42%). The raw reads were then filtered by Trimmomatic (version 0.36) to remove low-quality reads in the following order: (1) remove adaptor sequences; (2) remove bases with an average quality < 3 at the 5’ or 3’ ends; (3) remove bases with an average quality < 15 in a 4-base sliding window from the 5’ end; (4) after trimming, reads with length < 50 bp were dropped out. After these steps, an average of 80.69 M clean reads for each sample were retained for subsequent analyses. Subsequently, clean reads of RNA-Seq data were aligned to the reference genome GRCh38/hg38.p12 (patch release 12) using HISAT2 (version 2.1.0)⁸⁹, and the count number of each gene was obtained by HTSeq-count (version 0.11.2)⁹⁰. Then the Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value was calculated using the TopHat-Cufflinks pipeline based on the count number⁹¹. The differences in mRNA abundance were calculated by DESeq2. If multiple testing was performed, P-values were adjusted using the Benjamini–Hochberg procedure.

After quality control, RNA-Seq resulted in an average of 80.76 M clean reads per sample. RNA-seq data analysis identified a total of 19,598 protein-coding genes with an average of 16,742 genes per sample. The distribution of gene expression was largely consistent across the 101 samples subjected to RNA-Seq analysis, suggesting a high degree of stability of our sequencing platform and homogeneity in the transcriptomic profiling of MTC tumors (Supplementary Fig. S2e).

For RNA-Seq variant calling, all reads that passed the quality control (QC) were first aligned using BWA (version 0.7.17, http://bio-bwa.sourceforge.net). SAMtools (version 1.9, http://www.htslib.org) was used to sort the above results and prepare the aligned bam files. Afterwards, variant calling was conducted with BCFtools (version 1.9, http://www.htslib.org) and all obtained variants were annotated with snpEff (version 3.4p, http://pcingola.github.io/SnpEff/). During the whole process, highly accurate variants were filtered with parameters: DP > 10, QUAL > 40, and MQ (RMS Mapping Quality) > 20.

Tumor purity was predicted with the ESTIMATE package¹² in RStudio using FPKM values of the normalized RNA-Seq data.

According to a previous study¹⁵, E2F hallmark gene sets were queried from the MSigDB Hallmark gene sets⁹². For these genes, ssGSEA was conducted for tumor samples using FPKM value of RNA-Seq data via the GSVA R package⁹³. Prior to data normalization, read counts of 0 were treated as NAs. Enrichment scores were calculated from ssGSEA and were regarded as the E2F pathway activity score.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol