The raw data from 101 tumor samples were then demultiplexed and converted into FASTQ format. Sequencing reads were checked for quality and adaptor/primer sequence contamination first by FastQC (version 0.11.5). Before filtering, the proportion of sequenced bases with ≥ Q30 value ranged from 90.66% to 96.03% (average: 93.63%), while GC-content ranged from 45.22% to 62.33% (average: 49.42%). The raw reads were then filtered by Trimmomatic (version 0.36) to remove low-quality reads in the following order: (1) remove adaptor sequences; (2) remove bases with an average quality < 3 at the 5’ or 3’ ends; (3) remove bases with an average quality < 15 in a 4-base sliding window from the 5’ end; (4) after trimming, reads with length < 50 bp were dropped out. After these steps, an average of 80.69 M clean reads for each sample were retained for subsequent analyses. Subsequently, clean reads of RNA-Seq data were aligned to the reference genome GRCh38/hg38.p12 (patch release 12) using HISAT2 (version 2.1.0)89, and the count number of each gene was obtained by HTSeq-count (version 0.11.2)90. Then the Fragments Per Kilobase of exon model per Million mapped fragments (FPKM) value was calculated using the TopHat-Cufflinks pipeline based on the count number91. The differences in mRNA abundance were calculated by DESeq2. If multiple testing was performed, P-values were adjusted using the Benjamini–Hochberg procedure.
After quality control, RNA-Seq resulted in an average of 80.76 M clean reads per sample. RNA-seq data analysis identified a total of 19,598 protein-coding genes with an average of 16,742 genes per sample. The distribution of gene expression was largely consistent across the 101 samples subjected to RNA-Seq analysis, suggesting a high degree of stability of our sequencing platform and homogeneity in the transcriptomic profiling of MTC tumors (Supplementary Fig. S2e).
For RNA-Seq variant calling, all reads that passed the quality control (QC) were first aligned using BWA (version 0.7.17, http://bio-bwa.sourceforge.net). SAMtools (version 1.9, http://www.htslib.org) was used to sort the above results and prepare the aligned bam files. Afterwards, variant calling was conducted with BCFtools (version 1.9, http://www.htslib.org) and all obtained variants were annotated with snpEff (version 3.4p, http://pcingola.github.io/SnpEff/). During the whole process, highly accurate variants were filtered with parameters: DP > 10, QUAL > 40, and MQ (RMS Mapping Quality) > 20.
Tumor purity was predicted with the ESTIMATE package12 in RStudio using FPKM values of the normalized RNA-Seq data.
According to a previous study15, E2F hallmark gene sets were queried from the MSigDB Hallmark gene sets92. For these genes, ssGSEA was conducted for tumor samples using FPKM value of RNA-Seq data via the GSVA R package93. Prior to data normalization, read counts of 0 were treated as NAs. Enrichment scores were calculated from ssGSEA and were regarded as the E2F pathway activity score.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.