The OLC approach is based on graph overlaps. Fundamentally, this approach operates in three stages. First, overlaps (O) are found between all the reads. The OLC method then creates the layout (L) of all reads and overlaps the details in the graph. The consensus sequence (C) is then concluded from the multiple sequence alignments(MSA) (Wang & Jiang, 1994; Idury & Waterman, 1995).
An overlap graph represents the sequencing reads and their overlaps are used in the OLC process (Miller, Koren & Sutton, 2010; Koren et al., 2012). Overlaps must be pre-computed, and overlap detection between each pair is explicit, usually by all-against-all pairwise alignment (Altschul et al., 1990; Haque, Aravind & Reddy, 2009; Giegerich & Wheeler, 1996). The overlap graph indicates overlaps between reads with nodes and edges (Myers, 1995). As a result, the OLC method constructs a read graph that places reads as nodes and assigns a relationship between two nodes when these two reads overlap longer than the cutoff length. Paths through the graph are regarded as candidate contigs, and these paths can be translated into a series of genome sequences (He et al., 2013) as illustrated in Fig. 1. This process is further described in the next three steps.
The overlap of each pair of reads is identified using all-against-all pairwise read alignment. K-mers pre-calculation for all reads would improve performance considerably. It selects candidates that share K-mers and measures alignment by using K-mers as alignment seeds. The detection of overlaps is overly sensitive to limited overlap length and the size of the K-mer. The selection of these parameters can therefore significantly influence the assembler’s efficiency. There would be too many candidates for small parameter values, while large values, in comparison, can lead to accurate contiguities that are shorter. Consequently, finding a good balance requires a considerable amount of time.
Based on the overlap information, the OLC constructs an overlap graph. Within this step, the OLC finds a special form of path, i.e., a simple path where every node is distinct. This path is a Hamiltonian path as the nodes will be visited exactly once. However, finding a Hamiltonian path is an NP-hard problem. This problem is solved in practice using a greedy strategy or heuristic algorithms.
Finally, the OLC performs multiple sequence alignment (MSA). MSA is intended to decide the exact layout and voting strategies. Alternatively, it may use statistical methods to define the best consensus sequence. However, no method efficiently resolves the optimal MSA problem. Therefore, the consensus stage uses pairwise alignments driven by the approximate read layout.
Canu is a modern long read sequence assembler that strengthens and replaces the Celera Assembler (Myers et al., 2000; Miller et al., 2008). By integrating the MinHash Alignment Method (MHAP) with the PBcR (Koren, 2012) and Celera Assembler, the computational bottlenecks of overlapping noisy, single-molecule sequencing reads can be addressed (Berlin et al., 2015). Furthermore, Canu integrates these methods into one comprehensive assembler, which supports PacBio and Oxford Nanopore data, reduces the runtime and coverage needs, and improves the separation of repeats and haplotypes. It, therefore, improves the mammalian genomes’ efficiency by some degree and exceeds hybrid processes with as little as 20×single molecule coverage (Koren et al., 2017).
Canu (Koren et al., 2017) introduces several additional features to improve usability and efficiency with single-molecule sequence data, including computational resource discovery, adaptive k-mer weighting, automated error rate estimation, sparse graph construction, and graphical fragment assembly (Li, 2016) outputs. The Canu pipeline comprises three stages, each of which can be corrected, trimmed, and assembled in series or separately (Koren et al., 2017). Canu automatically senses available resources and configures itself to optimize the usage of resources when operating in parallel environments.
FASTA (Lipman & Pearson, 1985) or FASTQ (Cock et al., 2010) data is the primary data exchange,but for consistency, the input stores reads for each stage in an indexable database that no longer needs theoriginal input. Each of the three stages starts with the identification of overlaps between all input pairs. Although each stage has a different overlapping strategy, each one has an indexed store of these overlapsobtained by counting k-mers in the reads.
From the input reads, corrected reads are generated from the correction stage, unsupported bases and other anomalies are trimmed in the trimming stage, and assembly graphs and contigs in the assembly stage are finally built.
HINGE is another OLC assembler that aims to achieve an optimal resolution by differentiating between repeats that can be resolved and those that cannot (Kamath et al., 2017). To do this, hinges are applied to the reads for creating a graph where only unresolvable repeats are combined. Consequently, HINGE blends the error resilience of overlapping graphical methods, the elegant graph structure, and optimal repeat resolution of graphs by de Bruijn.
HINGE is a long read assembler that follows the paradigm of the overlap layout. Its key algorithmic advancement lies in how it uses the alignments achieved in the overlapping process to recognize resolvable repeats and build the structure of the graph repetitively (Pevzner & Tang, 2001; Mulyukov & Pevzner, 2002). To equip some of the reads with hinges, HINGE uses the alignment information collected during the overlap phase. The Contagion algorithm is utilized to disperse the information of how it bridges repeats to other reads (Kamath et al., 2017).
The Contagion algorithm allows HINGE to position precisely one in-and-out hinge on reads that have emerged from an unbridged repeat. Then, with a hinge-assisted greedy graph, HINGE can construct a sparse overlap graph. Given that our reads are hinged, as long as the match starts on the hinge, we also allow a read successor or predecessor to be within another read. Incidentally, the graph forms a bifurcation that corresponds to an unbridged repeat’s beginning or end.
This hinge-aided approach helps us, within the OLC substructure, to achieve the attractive features of a de Bruijn graph.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.