2.2. Next-Generation Sequencing and Deep Transcriptome Analysis

CE Caralina Marín de Evsikova
IR Isaac D. Raplee
JL John Lockhart
GJ Gilberto Jaimes
AE Alexei V. Evsikov
request Request a Protocol
ask Ask a question
Favorite

Second generation sequencing techniques emerged in 2005 (Figure 2), and equipment fundamentally differs from first generation sequencers because multiple different DNA molecules are sequenced concurrently. As a result, tens of thousands to hundreds of millions of individual sequencing reads are produced with each run. Different principles underlying sequencing and detection, and different chemistries behind various platforms lead to large differences in read length, base call accuracy, and total number of output reads. The largest obstacle for second generation sequencers is obtaining read length to read quality ratios comparable to Sanger sequencing, with most platforms producing average reads with less than 300 bases. In addition, the samples are sequenced in a stop-read-start manner that leads to lengthy processing times, with some platforms requiring over a week for a single run to complete. To make these platforms economical, the number of reads per run has been increased through the introduction of larger machines, such as the Illumina HiSeq series, or denser chips, in the case of Ion Torrent. However, the larger sequencers have a substantially higher price and require processing at full capacity to benefit from the increased throughput and, consequently, are not typically found in individual laboratories or small research consortia. There are smaller platforms available from Illumina, 454 Roche, and Ion Torrent that produce longer length reads than the larger sequencers, thereby suit the needs of small research consortia and well-funded laboratories [33].

All second generation sequencing platforms require modification and amplification of sample DNA. Samples are fragmented and adapters are annealed to the ends. For platforms that use emulsion PCR (emPCR) to amplify the samples, the adapters allow the fragments to bind to complementary bases on the emulsion beads. SOLiD sequencing further modifies the fragments after amplification by adding regions that allow the fragments to covalently bond with the sequencer slide. The Illumina platform uses a bridge PCR to amplify the samples, which have been modified with adapters to the base pair with oligonucleotides embedded on the sequencer slide.

Each platform also employs a different method for generating the base calls for each sample, but only Ion Torrent does not use a light-based recording method. The base calls are reported by pyrosequencing (Figure 4) in 454 Roche platforms, and by fluorescent tag cleavage in Illumina and SOLiD platforms. The Illumina platform produces forward and reverse reads from each DNA fragment and SOLiD identifies each fragment’s bases twice, thereby increasing accuracy. Ion Torrent uses a microchip with pH meters incorporated into each well to detect the release of an H+ ion with each base incorporated.

Principle of pyrosequencing.

Extension of fragments occurs during sequential “flooding” of the sequencing reaction chamber with solutions containing specific nucleotides. Illumina differs from other platforms by using a reaction mixture containing all 4 nucleotides. The Illumina nucleotides are modified with a fluorescent group plus a terminator to prevent introduction of additional bases in the cycle. The fluorescence is recorded and its tag cleaved before flooding the sequencer with the nucleotide-containing reaction mixture again. In pyrosequencing (Figure 4), the nucleotides have a modified pyrophosphate group that is cleaved after addition. SOLiD sequencing uses di-base oligonucleotides with a 3-base extended region and a fluorescent tag. An (n+1)-long primer is added after each round of synthesis which, after 5 repetitions, emits two base signals for each incorporated nucleotide. Nucleotides in Ion Torrent sequencers are added in alternating “floods” of A, T, C, and G. As each base is paired to the fragment, an H+ ion is released and detected by the sequencer microchip.

The recent ability to interrogate the transcriptome of individual cells using second generation sequencers has revealed heterogeneity in gene expression of individual cells within a population. As the name implies, single-cell RNA sequencing (scRNA-seq) relies on the isolation and amplification of transcriptomes from individual cells, and many different isolation and amplification strategies have been developed, such as Cel-seq2 [34], Smart-seq2 [35] and Drop-seq [36]. Isolation of individual cells is accomplished by using microfluidic capture chips (Cel-seq2), fluorescence activated cell sorting (Smart-seq2), or droplet emulsion (Drop-seq). Most scRNA-seq protocols, excluding Smart-seq, incorporate cell-specific barcodes during the reverse transcription reaction that allows for large-scale multiplexing. Smart-seq, in contrast to other scRNA-seq methods, generates full length cDNA and can more accurately differentiate between splice variants. A side-by-side comparison of these scRNA-seq strategies found that Drop-seq was the most cost-effective method, whereas Smart-seq was the most accurate [37]. Analyzed cells may be clustered based on expression levels of selected genes either to detect changes in cell populations or within a population induced by a disease. This strategy to separate and sequence by cell type was recently used to analyze normal and atherosclerotic aortas from mice and detected a previously unreported population of macrophages that expressed high levels of triggering receptor expressed in myeloid cells 2 (Trem2) gene in diseased aortas, including atherosclerosis [38].

Next generation sequencers are powerful tools, but they are not without flaws and errors that can arise at any step of the sequencing process. Firstly, errors may be introduced by polymerase during the amplification of sample cDNA, and research indicates that this may be the primary source of errors in second generation sequencing data [39]. Secondly, errors originate from the chemistry used by the various platforms, and often manifest in nucleotide substitutions, insertions, or deletions [33]. The error rates of second generation sequencers are principally increased in homopolymeric regions caused by the incorporation of multiple bases in a single cycle. AT-enriched regions and genomes cause increased error rates in next generation sequencers, possibly from PCR artifacts and nonrandom fragmentation of sample DNA [40]. Errors due to AT-richness are most pronounced in the Ion Torrent platforms [41]. Furthermore, when utilizing single-cell sequencing strategies, comparison between samples can be greatly impaired by poor matching of samples, the stages of disease progression, and the variability between individuals can compound the inherent heterogeneity that is present when comparing individual cells. While the ability to determine the response and contribution of individual cell types to disease progression is important, more samples are necessary to identify and distinguish between inter-individual and intra-individual variations.

For next-generation RNAseq analysis, the most important parameters to consider in experimental design in order to substantially increase the quality of downstream analysis are: the number of biological replicates, the depth of sequencing (i.e., number of reads produced for each sample), read length, single-end vs. pair-end sequencing (i.e., each sequenced DNA molecule is represented by a single strand read vs. two reads from each strand), and RNA extraction. Under budgetary constraints, tradeoffs between sequencing depth and the amount of biological replicates are often made. As consistently reported, the requisite number of biological replicates (n = 3–4) is more critical for robust, reliable, and replicable analysis than sequencing depth [42,43,44,45]. As technologies improve, sequence lengths increase. For differential expression, little difference is seen if the length is >25 bps, in either single-end or pair-end sequencing. However, for greater accuracy in transcript identification and splice junction detection, reads should be pair-end and ≥100 bp [46]. The RNA extraction method impacts the ratio of RNAs present during sequencing, and a specific strategy should be chosen with the biological or biomedical question of interest in mind. For example, total RNA extraction is useful in capturing unique transcriptome features, such as noncoding RNA. However, ribosomal RNA (rRNA) comprises >90% of total RNA and should be depleted if noncoding, non-ribosomal RNA is to be assessed. Current techniques cannot completely remove rRNA, and ~2%–35% residual remains in the sample. Therefore, greater sequencing depth should be considered when using ribosomal depletion methods to counter the abundance of rRNA and improve detection of other transcripts. In eukaryotic organisms, if only protein coding genes are of interest, poly(A) selection yields greater accuracy of transcript quantification [47]. These issues are particularly critical for clinical samples from patients, which are routinely processed as formalin-fixed, paraffin-embedded (FFPE) samples, which adversely impact the quality of RNA and subsequent alignment to pseudogenes [48]. Fortuitously, side-by-side comparison of FFPE and flash-frozen samples shows a great degree of concordance (e.g., r2 in the range of 0.90–0.97 in recent studies [49,50]), proving RNAseq is a viable tool for gene quantification in clinical settings. Controls, depending upon availability, need to be non-diseased tissue, either of the same patient origin or from another individual without the disease [51]. In addition, given atherosclerosis is a common disease, patients are from genetically diverse, heterogeneous populations with variable symptomology, which requires more samples to detect meaningful changes in the transcriptome truly reflecting disease process. However, in other diseases, such as breast cancer, as few as n = 9–10 patient samples (plus samples of healthy controls), have been ample to detect specific alleles and molecular pathways [51].

Despite the errors that may occur when using second generation sequencers, several advantages over the original transcriptome technologies, such as Sanger sequencing, EST and SAGE (Figure 1, Section 2.1), warrant their use experimentally and clinically. First of all, second generation sequencers offer orders of magnitude deeper coverage of sample RNA than achieved by Sanger sequencing, via EST libraries, yielding overall faster discovery and more accurate analysis of an entire transcriptome. Also, the length and quality of sequence produced by second generation sequencers are much better than the fragments produced in SAGE, which improves transcriptome accuracy. While EST sequencing typically produced fragments of at least 500 bp, most second-generation sequencing produces shorter read lengths, albeit, read length from second generation sequencers can be increased at the expense of read depth. Next generation sequencers have advantages over microarrays because essentially all expressed transcripts and their variants can be detected, without restriction to the probes present on the microarray chip or beads [52], plus the ability to barcode different samples, or conditions, within a single sequencing procedure permits multiplexing of samples.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A