4.1. Genome Sequencing

HY Hee Min Yoo
IK Il-Hwan Kim
SK Seil Kim
request Request a Protocol
ask Ask a question
Favorite

Unprecedentedly massive genome sequencing has been undertaken with SARS-CoV-2 strains. The total number of sequenced genomes is approximately a million at present [30]. As the number of genomes increases rapidly, the World Health Organization (WHO) has provided guidelines for the genome sequencing of SARS-CoV-2 [159]. According to this guideline, the genome sequencing of SARS-CoV-2 can be used for understanding the emergence of SARS-CoV-2, understanding the biology of SARS-CoV-2, improving diagnostics and therapeutics, investigating virus transmission and spread, and inferring epidemiological parameters [159]. Based on the accumulative genome sequences of SARS-CoV-2, the emergence of the variants of concern [43,44,48,49,57,62], the origin of SARS-CoV-2 [2,31], and the mutation frequency of RT-qPCR primer/probe sites [30] can be known to humanity. Currently, various whole genome sequencing methods of the virus are being developed [160,161,162,163,164,165,166]. The sequencing methods of viruses can be categorized into the metagenomics approach and target enrichment based methods. In the metagenomics approach, the viral genome can be extracted from clinical samples and the extracted nucleic acid are sequenced. This can also be done with cultured viruses. These approaches have a clear advantage over target enrichment based methods. The metagenomics approach can be used even if there is no information of the pathogen or there are novel pathogens that were not previously known. However, a high proportion of host cell genetic materials can be found, which should be removed or reduced for sequencing. The removal or depletion methods of the host genetic materials vary by the type of sample or the virus [167,168,169,170,171,172,173,174]. Due to the nature of the metagenomics approaches, the clinical samples should ideally have a high titer of the pathogens. The metagenomics approach also can be done with cultured pathogens. However, the isolation and culturing of the pathogen are very time-consuming and labor-intensive work. In some cases, the isolation and culturing of some pathogens are not possible or are very difficult [175,176]. Alternatively, target enrichment methods can be used for the genome sequencing. The genetic materials of specific pathogens can be enriched through hybrid capture probes [177,178]. The sequences of the probes are complementary to the genome sequences of specific pathogens and these target enrichments effectively remove not-target sequences and increase the proportion of the target sequences. One of the advantages of hybrid capture approaches is the tolerance of sequence mismatch, allowing the capture of divergent variants. However, the hybrid capture approaches are relatively more expensive and complicated than other approaches. Another group of target specific enrichment approaches is the amplicon based approaches. The amplicon-based approaches are mainly dependent on PCR reactions. The PCR reaction can selectively enrich the genome of the target pathogens in the presence of non-target nucleic acids just like host genetic materials. Due to the nature of PCR, the amplicon based approaches are relatively more inexpensive, sensitive, and specific than other approaches. The WHO guidelines suggest that the complete genome sequencing can be done from the sample with Ct values of up to 30 and the partial genome sequencing can be done from the sample with Ct values of 30–35, although the Ct value can vary with various factors [159]. Currently, the most widely used primer panels for SARS-CoV-2 are ARTIC network amplicon sets [179]. At least three commercially available SARS-CoV-2 primer panels (CleanPlex SARS-CoV-2 Panel; Paragon genomics, QIAseq SARS-CoV-2 Primer Panel; Qiagen, and NEBNext ARTIC SARS-CoV-2 Library Prep Kit; NEB) are based on ARTIC network amplicon sets. However, there are also limitations. The design of primers requires prior knowledge of full sequence information of the genome. In addition, the primer intolerance to the sequence mismatch hinders the genome sequencing of the variants. The amplicon appears that the amplicon based approaches can be applied only to previously well-known pathogens. Alternatively, the genomic materials can be amplified sequence-independently [163,180,181]. Single primer isothermal amplification (SPIA) can amplify the genomic materials in a sequence-independent manner. As SPIA can amplify the genomic materials, prior knowledge of the pathogens is not required for a target enrichment approach. However, removing non-target genomic materials is mandatory for sequencing based SPIA as SPIA can also amplify non-target genomic material. Due to these characteristics, a high proportion of target genomic materials in the samples is crucial for successful SPIA based sequencing. SPIA based sequencing with low viral input showed very low coverage compared to other methods [180].

For the genome sequencing of pathogens, various sequencing technology can be used. While the conventional Sanger sequencing can still be used for viral genome sequencing [182], most SARS-CoV-2 genome sequencing is done with NGS sequencing technology. Currently, the most widely used NGS sequencing technology is the sequencing platforms of Illumina. Although the sequence length of individual reads is relatively short (paired-end 150 bp), the throughput and the accuracy of the individual reads are outstanding. Ion Torrent is another short reads sequencing platform technology, where the length of individual reads is 400 bp or 600 bp. The running time of the Ion Torrent sequencer is shorter than that of the Illumina sequencer. Long read alternatives are also available. The lengths of individual reads from PacBio and Oxford Nanopore Technology sequencers are tens of kilo base pairs or more. The individual reads from these long read sequences can cover most of the viral genome. However, the throughput of the long read sequencers is relatively lower than that of short read sequencers such as Illumina sequencing platforms. Furthermore, the accuracy of the individual reads is relatively lower than that of Illumina sequencing platforms. The sequencing platforms of Oxford Nanopore Technology maximize the benefits and drawbacks of the long read sequencer. The maximum length of the individual read is recorded up to a megabase scale [183]. However, due to the relative low accuracy of the individual reads, the WHO guidelines do not recommend these for SARS-CoV-2 genome sequencing unless the sequencing is replicated [159,184]. The coverage and depth of viral genome sequencing can be varied by the number of samples in single runs. Generally, most multiplex sequencing library kits for NGS support up to 384 samples per single runs. However, the production scale sequencers of Illumina (Hiseq, NextSeq, and NovaSeq) generate massive reads for small genome of viruses even with multiplex libraries. Though short individual reads require relatively more depth for high coverage, massive generation of sequence reads and low sequencing error rates of individual reads can compensate for short individual read length. Due to the massive sequence reads generation of the Illumina sequencer, the metagenomics approach of viral genome sequencing is practically available to only Illumina sequencer or similar platforms. Even if most of the sequence reads are non-target sequence reads (host genetic materials, contamination, etc.), a high quality genome assembly of the virus can be produced from the small remaining fraction of the target sequences. The long read sequencers such as sequencers by Pacbio and Oxford Nanopore Technology can generate very long individual reads that can cover most of the viral genome. However, due to the relatively low yield of total sequence reads and low accuracy of individual reads, the long read sequencers are not adequate for the metagenomics approach of viral genome sequencing. Instead, the long sequencers are more suitable for the amplicon-based approach. The target specific amplification can overcome the drawbacks of long read sequencers such as the low yield of total sequence reads and improving the accuracy of individual reads. Moreover, long sequencers can use long amplicons unlike short read sequencers. The schematic procedure of genome sequencing is shown in Figure 3.

Schematic procedure of genome sequencing.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A