Indexing continuous-valued seeds

Davide Verzotto; Audrey S. M. Teo; Axel M. Hillmer; Niranjan Nagarajan

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Indexing continuous-valued seeds

DV Davide Verzotto

AT Audrey S. M. Teo

AH Axel M. Hillmer

NN Niranjan Nagarajan

This method is extracted from research article: Gigascience, Jan 2016

OPTIMA: sensitive and accurate whole-genome alignment of error-prone genomic maps by combinatorial indexing and technology-agnostic statistical analysis

DOI: 10.1186/s13742-016-0110-0

Request a Protocol

Ask a question

Favorite

The definition of appropriate seeds is critical in a seed-and-extend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discrete-valued seeds to continuous values is to consider values that are close to each other (as defined by the C_σ bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (c-mer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. Fig.22 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuous-valued seeds, analogous to the work on spaced seeds for discrete-valued sequences [21].

Comparison of sensitivity between different seeding approaches for the human genome. a The easier scenario (a). b The harder scenario (b). For each corresponding length in fragments, we report the percentage of maps with at least one correct seed detected (out of 100 maps). Note that the approach used in OPTIMA, Composite seeds (iv), was able to find the correct location for more than 99 and 88 % of maps with at least ten fragments in scenarios (a) and (b), respectively

Let r_j₁, r_j₂ and r_j₃ be consecutive restriction fragments from a reference in silico map. A continuous-valued composite seed, for c=2, is given by including all of the following:

(i) the c-mer r_j₁, r_j₂, corresponding to no false cuts in the in silico map;

(ii) the c-mer r_j₁ + r_j₂, r_j₃, corresponding to a missing cut in the experimental map (or false cut in the in silico map); and

(iii) the c-mer r_j₁, r_j₂ + r_j₃, corresponding to a different missing cut in the experimental map (or false cut in the in silico map).

The reference index would then contain all c-tuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments o_i₁, o_i₂ and o_i₃ in the experimental maps, we search for c-tuples of the type o_i₁, o_i₂ and o_i₁ + o_i₂, o_i₃ in the index (see Composite seeds (iv) depicted in Fig. Fig.11 1cc).

To index the seeds, we adopt a straightforward approach where all c-tuples are collected and sorted into the same index in lexicographic order by c₁ (where the c_i are elements in the c-tuple). Lookups can be performed by binary search over fragment-sized intervals that satisfy the C_σ bound for c₁ and a subsequent linear scan of the other elements c_i, for i≥2, while verifying the C_σ bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used $C_{σ}^{Seed} = 2$ ).

As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. Fig.2,2, which highlights the advantages of composite seeds with respect to 2-mers.

Overall, the computational cost of finding seeds using this approach is O(m (logn+c #seeds_c=1)) per experimental map, where n is the total length of the in silico maps in fragments, m≪n is the length of the experimental map and #seeds_c=1 is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol