WGS, WES, and RNA-Seq analysis

JS Jason R. Schwartz
JM Jing Ma
JK Jennifer Kamens
TW Tamara Westover
MW Michael P. Walsh
SB Samuel W. Brady
JM J. Robert Michael
XC Xiaolong Chen
LM Lindsey Montefiori
GS Guangchun Song
GW Gang Wu
HW Huiyun Wu
CB Cristyn Branstetter
RH Ryan Hiltenbrand
MW Michael F. Walsh
KN Kim E. Nichols
JM Jamie L. Maciaszek
YL Yanling Liu
PK Priyadarshini Kumar
JE John Easton
SN Scott Newman
JR Jeffrey E. Rubnitz
CM Charles G. Mullighan
SP Stanley Pounds
JZ Jinghui Zhang
TG Tanja Gruber
XM Xiaotu Ma
JK Jeffery M. Klco
request Request a Protocol
ask Ask a question
Favorite

DNA and RNA material was isolated from bulk myeloid or isolated lymphocytes by standard phenol:chloroform extraction and ethanol precipitation. Whole genome sequencing libraries were constructed using the TruSeq DNA PCR-Free sample preparation kit (Illumina, Inc., CA) following the manufacturer’s instructions and whole-exome sequencing was completed using the Nextera Rapid Capture Expanded Exome reagent (Illumina). After library quality and quantity assessment, WGS, WES, or RNASeq samples were sequenced on various Illumina platforms (HiSeq 2500, HiSeq 4000, or NovaSeq 6000). Mapping, coverage, quality assessment, single-nucleotide variant (SNV) and indel detection, and tier annotation for sequence mutations (SNVs discovered by WGS were classified as tier 1, tier 2, tier 3, or tier 4) have been described previously5557 and briefly described here. DNA reads were mapped using BWA58,59 (WGS: v0.7.15-r1140; WES: v0.5.9-r26-dev and v0.7.12-r1039 since data were generated over a period of time) to the GRCh37/hg19 human genome assembly. Aligned files were merged, sorted and de-duplicated using Picard tools 1.65 (broadinstitute.github.io/picard/). SNVs and Indels in WGS and WES were detected using Bambino60. For WGS data, sequence variants were classified into the following four tiers: (i) tier 1: coding synonymous, nonsynonymous, splice-site and noncoding RNA variants; (ii) tier 2: conserved variants (conservation score cutoff of greater than or equal to 500, based on either the phastConsElements28way table or the phastConsElements17way table from the UCSC Genome Browser) and variants in regulatory regions annotated by UCSC (regulatory annotations included are targetScanS, ORegAnno, tfbsConsSites, vistaEnhancers, eponine, firstEF, L1 TAF1 Valid, Poly(A), switchDbTss, encodeUViennaRnaz, laminB1 and cpgIslandExt); (iii) tier 3: variants in non-repeat masked regions; and (iv) tier 4: the remaining SNVs. Structural variations in whole-genome sequencing data were analyzed using CREST61 (v1.0). RNA-sequencing was performed using TruSeq Stranded Total RNA library kit (Illumina) and analyzed, as previously described16,17. Briefly, RNA reads were mapped using our StrongARM pipeline (internal pipeline, described by Wu et al.62). Paired-end reads from RNA-seq were aligned to the following four database files using BWA: (i) the human GRCh37-lite reference sequence, (ii) RefSeq, (iii) a sequence file representing all possible combinations of non-sequential pairs in RefSeq exons and, (iv) the AceView database flat file downloaded from UCSC representing transcripts constructed from human ESTs. Additionally, they were mapped to the human GRCh37-lite reference sequence using STAR. The mapping results from databases (ii)–(iv) were aligned to human reference genome coordinates. The final BAM file was constructed by selecting the best of the five alignments. Chimeric fusion detection was carried out using CICERO63 (v0.3.0) and Chimerascan64 (v0.4.5). All identified fusions were validated by either RT-PCR, cytogenetics, manual review of CREST data, or a combination of these methods (Supplementary Data 18, 20, & Supplementary Figs. 9 and 18). Mapping statistics and coverage data are described in Supplementary Data 68 & 15. Recurrent SNV’s identified by WGS or WES were validated by custom capture resequencing (Supplementary Data 2, 3, and 19). Custom capture baits were designed (Twist Biosciences) to be 80 nucleotides long covering the provided hg19 target region consisting of 1,006,633 unique base pairs (bp). A total target region of 904,622 bp is directly covered by 11,455 probes. BWA58,59 (v0.7.12) MEM algorithm was used to map the TWIST sequencing reads to the GRCh37/hg19 human genome assembly. Rsamtools65 (v1.30.0) was used to retrieve read counts from BAM files for the SNV/Indels called in WES, requiring MAPQ > = 1 and base quality Phred score > = 20. We also performed de novo mutation calling in an attempt to catch canonical low variant allele frequency (VAF) cancer gene mutations missed by WES using VarScan 266 (v2.3.5) on the TWIST data with the following criteria: MAPQ > = 1; base quality Phred score > = 20; VAF > = 0.01 and variant call p-value < = 0.05. Selected somatic variants (WES read count <5 and targeted capture read count <10) and all somatic TP53 variants identified via WES were validated by custom amplicon sequencing. PCR primers (Supplementary Data 22) were designed to flank the putative variants. Amplicon sizes were approximately 200 base pairs. PCR was performed using KAPA HiFi HotStart ReadyMix (Roche), 100 nM of each primer (IDT) and 20 ng of gDNA in a 40uL reaction volume. Thermocycling was performed using the following parameters: 95 °C for 3 min; 98 °C for 20 s, 62 °C for 15 s, and 72 °C for 15 s for a total of 30 cycles; and 72 °C for 1 min. All amplicons were quality checked on a 2% agarose gel. Primers were designed to incorporate Illumina overhang adapter sequences which allowed for indexing using the Nextera XT Index kit (Illumina) following the manufacturer’s instructions. Libraries were normalized, pooled, and sequenced on an Illumina MiSeq instrument using a 2 × 150 paired-end version 2 sequencing kit. We used the CleanDeepSeq52 approach with default settings for error suppression in this ultra-deep amplicon sequencing.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A