De Bruijn Graph (DBG) method

Firaol Dida; Gangman Yi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

De Bruijn Graph (DBG) method

FD Firaol Dida

GY Gangman Yi

This method is extracted from research article: PeerJ Comput Sci, Jul 2021

Empirical evaluation of methods for de novo genome assembly

DOI: 10.7717/peerj-cs.636

Request a Protocol

Ask a question

Favorite

The De Bruijn Graph (Flicek & Birney, 2009; Schatz, Delcher & Salzberg, 2010; Miller, Koren & Sutton, 2010; Compeau, Pevzner & Tesler, 2011) method builds the whole-genome sequence of short reads. It utilizes k-mer graphs, suitable for large numbers of short reads. For the k-mer graph, it is no longer practical to consider all-against-all overlap. Each node represents a k-mer if an overlap of k-1 bases occurs and appears continuously in a read, and a directed arc will occur between the two nodes. There is also no need for individual reads and their overlap data to be saved (He et al., 2013). The overlap between adjacent k-mers is implicitly determined by cutting the reads into k-mers and recording their adjacent relationships. To summarize, the DBG approach generates a k-mer graph that considers k-mers to be nodes and assigns an edge to two nodes in the genome sequence where they are neighbors as illustrated in Fig. 2.

Assembly methods based on de Bruijn graphs begin with the replacement of each read with the set of all overlaps of a shorter fixed length (Liao et al., 2019; Chaisson, Wilson & Eichler, 2015). Usually, k denotes the length of the k-mers sequences.

The value of k is significant for constructing the de Bruijn graph (Luo et al., 2015). Some short redundant areas will be removed by a large value of k, thus reducing the number of nodes in the de Bruijn graph (Benoit et al., 2014), but this will induce disconnected subgraphs. A small value of k will minimize those gap areas, thus increasing the connectivity of the de Bruijn graph and adding additional nodes and increasing short recurring regions. The value of k cannot, therefore, be too big or too small.

Adjacent to the reads, which cease at k-mers from repeat borders, are used to sculpt contigs. This requires very precise reads, which initially reduces the potential for reads to solve repeats longer than k-bases. It has the advantage that it does not require pairwise overlap storage and a graph structure that represents the genome’s repeat structure. The following are the steps to be performed:

SOAPdenovo (Li et al., 2010) has been used to compile several genomes successfully but its continuity, accuracy, and coverage, in repeat regions in particular, need to be enhanced. To overcome this difficulty, SOAPdenovo2 (Luo et al., 2012) has been built to solve more repeated contiguous areas, increase coverage and length in scaffolding, boost gap closure, and optimize for the large genome. The current architecture of the SOAPdenovo2 algorithm reduces the memory consumption for graph construction (Ye et al., 2012).

SOAPdenovo2 substantially improves: (1) the algorithm for error correction, (2) memory usage in DBG constructions, (3) assembly length and scaffolding coverage, (4) the closing of gaps, and (5) resolution of longer repeat areas in contig assemblies.

SOAPdenovo2 consists of six modules, like SOAPdenovo. (1) Genome DNA is randomly fragmented using paired-end technology and sequenced. Short read with sizes between 150 and 500bp are amplified and sequenced directly, whereas long-range (2–10 kb) paired-end libraries are constructed by circularizing DNA, fragmentation, and then cleansing fragments for cluster creation with ranges of 400–600 bp. (2) To represent the overlap between the reads, the raw or corrected reads are then loaded into computer memory, and de Bruijn graph data structure is used (Li et al., 2010). (3) By eliminating erroneous ties and resolving small repeats by read paths, the graph is simplified as follows: (a) Cut-off of short tips, (b) Deletion of links with low coverage, (c) Resolution of tiny repeats with a read path, and (d) fusion of bubbles formed by repeats or heterozygotes of diploid chromosomes. (4) The links at repeat boundaries are broken on the simplified graph, and unambiguous sequence fragments are output as contigs. (5) By using paired-end information, reads are realigned to contigs, and scaffolds are generated from unique contigs. (6) Finally, using the paired-end extracted reads, intra-scaffold gaps are filled.

SPAdes (Bankevich et al., 2012) is a universal A-Bruijn assembler (Pevzner, Tang & Tesler, 2004) for single and multi-cell assembly. It performs an operation that does not directly affect the sequences; rather, it performs graph topology, coverage, and length of the sequence. It uses k-mers for the sole purpose of constructing the de Bruijn graph. It performs graph-theoretical operations on the subsequent stages exclusively in graphs that need not be labeled by k-mers (Bankevich et al., 2012). The consensus sequence of DNA is restored at the last stage.

During reconstruction, a string from the set of its k-mer is often abstracted in fragment assembly. The de Bruijn approach to assembly leads to this abstraction, which underlies several algorithms for assemblies. However, a more progressive abstraction of NGS data takes into account the problem of reforming a string from a set of pairs of k-mers (k-bimers) at a distance approximately d in a string (Bankevich et al., 2012). The study of the latter abstraction has chiefly been associated with heuristic post-processing, even though there are simple algorithms available for the former abstraction in the de Bruijn graph (Pevzner, Tang & Waterman, 2001; Zerbino & Birney, 2008; Butler et al., 2008).

By adding k-bimer modification, which concedes the exact distances for the vast majority of the adjusted K-bimers (Bankevich et al., 2012), SPAdes solves the impracticality due to differences in the biread1 (and thus k-bimer) distances and adding PDBG-inspired paired assembly graph (Medvedev et al., 2011). SPAdes may use read-pairs; in particular, E+V-SC (Chitsaz et al., 2011) has been using the reads but has skipped the read-pairs pairing to prevent misassemblies because of a high level of chimeric read-pairs.

The four stages of SPAdes that resolve problems that are especially problematic in SCS are sequencing errors; non-uniform coverage; disparities in insert size; and chimeric reads and bireads.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol