RNA-Seq data analysis

XS Xiao Shi
YS Yaoting Sun
CS Cenkai Shen
YZ Yan Zhang
RS Rongliang Shi
FZ Fan Zhang
TL Tian Liao
GL Guojun Lv
ZZ Zhengcai Zhu
LJ Lianghe Jiao
PL Peng Li
TX Tiansheng Xu
NQ Ning Qu
NH Naisi Huang
JH Jiaqian Hu
TZ Tingting Zhang
YG Yanzi Gu
GQ Guangqi Qin
HG Haixia Guan
WP Weilin Pu
YL Yuan Li
XG Xiang Geng
YZ Yan Zhang
TC Tongzhen Chen
SH Shenglin Huang
ZZ Zhikang Zhang
SG Shuting Ge
WW Wu Wang
WX Weibo Xu
PY Pengcheng Yu
ZL Zhongwu Lu
YW Yulong Wang
LG Liang Guo
YW Yu Wang
TG Tiannan Guo
QJ Qinghai Ji
WW Wenjun Wei
request Request a Protocol
ask Ask a question
Favorite

The raw data from 101 tumor samples were then demultiplexed and converted into FASTQ format. Sequencing reads were checked for quality and adaptor/primer sequence contamination first by FastQC (version 0.11.5). Before filtering, the proportion of sequenced bases with ≥ Q30 value ranged from 90.66% to 96.03% (average: 93.63%), while GC-content ranged from 45.22% to 62.33% (average: 49.42%). The raw reads were then filtered by Trimmomatic (version 0.36) to remove low-quality reads in the following order: (1) remove adaptor sequences; (2) remove bases with an average quality < 3 at the 5’ or 3’ ends; (3) remove bases with an average quality < 15 in a 4-base sliding window from the 5’ end; (4) after trimming, reads with length < 50 bp were dropped out. After these steps, an average of 80.69 M clean reads for each sample were retained for subsequent analyses. Subsequently, clean reads of RNA-Seq data were aligned to the reference genome GRCh38/hg38.p12 (patch release 12) using HISAT2 (version 2.1.0)89, and the count number of each gene was obtained by HTSeq-count (version 0.11.2)90. Then the Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value was calculated using the TopHat-Cufflinks pipeline based on the count number91. The differences in mRNA abundance were calculated by DESeq2. If multiple testing was performed, P-values were adjusted using the Benjamini–Hochberg procedure.

After quality control, RNA-Seq resulted in an average of 80.76 M clean reads per sample. RNA-seq data analysis identified a total of 19,598 protein-coding genes with an average of 16,742 genes per sample. The distribution of gene expression was largely consistent across the 101 samples subjected to RNA-Seq analysis, suggesting a high degree of stability of our sequencing platform and homogeneity in the transcriptomic profiling of MTC tumors (Supplementary Fig. S2e).

For RNA-Seq variant calling, all reads that passed the quality control (QC) were first aligned using BWA (version 0.7.17, http://bio-bwa.sourceforge.net). SAMtools (version 1.9, http://www.htslib.org) was used to sort the above results and prepare the aligned bam files. Afterwards, variant calling was conducted with BCFtools (version 1.9, http://www.htslib.org) and all obtained variants were annotated with snpEff (version 3.4p, http://pcingola.github.io/SnpEff/). During the whole process, highly accurate variants were filtered with parameters: DP > 10, QUAL > 40, and MQ (RMS Mapping Quality) > 20.

Tumor purity was predicted with the ESTIMATE package12 in RStudio using FPKM values of the normalized RNA-Seq data.

According to a previous study15, E2F hallmark gene sets were queried from the MSigDB Hallmark gene sets92. For these genes, ssGSEA was conducted for tumor samples using FPKM value of RNA-Seq data via the GSVA R package93. Prior to data normalization, read counts of 0 were treated as NAs. Enrichment scores were calculated from ssGSEA and were regarded as the E2F pathway activity score.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A