String graph-based method

Firaol Dida; Gangman Yi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

String graph-based method

FD Firaol Dida

GY Gangman Yi

This method is extracted from research article: PeerJ Comput Sci, Jul 2021

Empirical evaluation of methods for de novo genome assembly

DOI: 10.7717/peerj-cs.636

Request a Protocol

Ask a question

Favorite

The string graph is a simplified version of a classic overlap graph with sequenced reads and a suffix to prefix overlaps with the non-transitive edges (Liao et al., 2019). The string graph is an essential data representation used by OLC assemblers. Indeed, the vertices in a string graph are the input reads, and the arcs correspond to the overlapping reads, which are contigs in the string graph. For long-read assembly, an overlap-based approach is a forthright approach because it assembles the long reads without being translated to k-mers.

The formulation of the string graph assembly is similar to a de Bruijn graph in principle. However, it has the advantage of not decomposing sequences into k-mers, but taking the complete length of a read sequence (Liao et al., 2019). From the overlap graph, the string graph can be extracted by first removing duplicate reads and contained reads, and then discarding transitive edges from the graph.

For long sequences and single-molecule sequencing reads with a high error rate, the overlap-based approaches are more acceptable than the de Bruijn graph-based methods.

SGA is an assembler based on FM-index (Ferragina & Manzini, 2005) derived from the compressed Burrows–Wheeler transform (Burrows & Wheeler, 1994), memory-efficient data structures, and assembly algorithms (Simpson & Durbin, 2012). In comparison to most de novo assemblers, which depend on de Bruijn graphs, the SGA model uses the overlap string graph, which can easily be paralleled.

As de novo assembly usually demands queries over the entire sequence, extensive datasets tend to be a practical problem for assembly software developers and users. The redundancy contained in a sequence is exploited using compressed data structures to reduce the memory needed to perform de novo assemblies.

The SGA algorithm is based on an FM-index query developed from a set of sequence reads. The SGA pipeline starts with various low-quality or ambiguous base calls by preprocessing the sequence reads to filter or trim reads (Simpson & Durbin, 2012). From the filtered set of reads, the FM-index is constructed and base-calling errors are detected and corrected using k-mer frequencies. Corrected reads are re-indexed and duplicate sequences are discarded, filtering out the remaining low-quality sequences and generating a string graph. Contigs, if paired-end or mate-pair data is available, are assembled from the string graph and built into scaffolds.

SGA provides the first functional assembler, to the best of our knowledge, of a mammalian-sized genome on a low-end computing cluster, given its low memory requirements and parallelization without requiring inter-process communication.

FALCON, a long-read assembler with perceptive analysis of diploid genomes, is designed to assemble haplotype contigs that represent the diploid genome with correctly phased homologous chromosomes (Chin et al., 2016). It also preserves ambiguity in the assembly graph and outputs the longest path through the graph along with alternate paths (Liao et al., 2019; Koren & Phillippy, 2015).

The FALCON assembler follows the hierarchical genome assembly process(HGAP) (Chin et al., 2013) design but uses components that are more computationally optimized. To create a string graph containing sets of ‘haplotype-fused’ contigs and bubbles representing divergent regions between homologous sequences, it begins by using reads. Next, using phase data from heterozygous positions that it identifies, FALCON-Unzip identifies read haplotypes. Phased reads are then used with phased single-nucleotide polymorphisms and structural variants to assemble haplotype contigs and primary contigs that form the final diploid assembly.

As compared to alternative short or long-read approaches, the FALCON-based assemblies are significantly more contiguous and complete. The phased diploid assembly capacitated the analysis of the structure of the haplotype and heterozygosities between homologous chromosomes, including the identification within coding sequences of widespread heterozygous structural variation.

Hifiasm, a modern de novo assembler that faithfully represents haplotype information in a phased assembly graph by using long high-fidelity sequence reads (Cheng et al., 2021). Hifiasm aims to maintain the contiguity of all haplotypes, unlike other graph-based assemblers that only seek to maintain the contiguity of one haplotype. This function allows for the development of a graph trio binning algorithm that is superior to regular trio binning.

Hifiasm corrects sequence errors while maintaining heterozygous alleles using haplotype-aware error correction and then builds phased assembly graphs using locally corrected reads for phasing information. In the phased assembly graph, only reads from the same haplotype are linked. hifiasm produces a fully phased assembly for each haplotype from the graph using complementary data that provides global phasing information. Only HiFi reads can be used by Hifiasm to produce an unphased primary assembly. This unphased primary assembly constitutes the phase blocks (regions), which can be solved with HiFi reads but cannot maintain phase information between two-phase blocks.

Hifiasm’s first few steps are relatively similar to the workflow of early long-read assemblers. Hifiasm performs an overlap alignment of all-vs-all and then corrects sequencing errors. hifiasm inspects the alignment of reads overlapping with the target read when given a target read to correct. An informative position on the target read is said to be provided if at the alignment two types of A/C/G/T bases are in place and if at least three reads support each type. If there are informative positions in the overlap and the read is not identical to the target read in all of these positions, the read is inconsistent with the target. Only clear reads are used by Hifiasm to correct the target read.

By default, Hifiasm performs three rounds of error correction. It then performs overlap alignment once more and constructs a string graph with a vertex representing an oriented read and an edge representing a consistent overlap. A pair of heterozygous alleles in the string graph will be represented by a bubble after transitive reduction. There is no data loss. If no additional data is available, hifiasm chooses one side of each bubble at random and produces a primary assembly, similar to Falcon-Unzip (Chin et al., 2016) and HiCanu (Nurk et al., 2020).

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol