Indexing continuous-valued seeds

DV Davide Verzotto
AT Audrey S. M. Teo
AH Axel M. Hillmer
NN Niranjan Nagarajan
request Request a Protocol
ask Ask a question
Favorite

The definition of appropriate seeds is critical in a seed-and-extend approach in order to maintain a good balance between sensitivity and speed. A direct extension of discrete-valued seeds to continuous values is to consider values that are close to each other (as defined by the Cσ bound) as matches. However, as mapping data typically have high error rates [13, 16] and represent short sequences (for example, on average, optical maps contain 10–22 fragments, representing roughly a 250 kbp region of the genome), a seed of c consecutive fragments (c-mer) is likely to have low sensitivity unless we use a naive c=1 approach (see Fig. Fig.22 for a comparison) and pay a significant runtime penalty that scales with genome size [14, 16]. Therefore, we propose and validate the following composite seed extension for continuous-valued seeds, analogous to the work on spaced seeds for discrete-valued sequences [21].

Comparison of sensitivity between different seeding approaches for the human genome. a The easier scenario (a). b The harder scenario (b). For each corresponding length in fragments, we report the percentage of maps with at least one correct seed detected (out of 100 maps). Note that the approach used in OPTIMA, Composite seeds (iv), was able to find the correct location for more than 99 and 88 % of maps with at least ten fragments in scenarios (a) and (b), respectively

Let rj1, rj2 and rj3 be consecutive restriction fragments from a reference in silico map. A continuous-valued composite seed, for c=2, is given by including all of the following:

(i) the c-mer rj1rj2, corresponding to no false cuts in the in silico map;

(ii) the c-mer rj1rj2rj3, corresponding to a missing cut in the experimental map (or false cut in the in silico map); and

(iii) the c-mer rj1rj2rj3, corresponding to a different missing cut in the experimental map (or false cut in the in silico map).

The reference index would then contain all c-tuples corresponding to a composite seed, as defined in Definition 4, for each location in the reference map. In addition, to account for false cuts in the experimental map, for each set of consecutive fragments oi1, oi2 and oi3 in the experimental maps, we search for c-tuples of the type oi1oi2 and oi1oi2oi3 in the index (see Composite seeds (iv) depicted in Fig. Fig.111cc).

To index the seeds, we adopt a straightforward approach where all c-tuples are collected and sorted into the same index in lexicographic order by c1 (where the ci are elements in the c-tuple). Lookups can be performed by binary search over fragment-sized intervals that satisfy the Cσ bound for c1 and a subsequent linear scan of the other elements ci, for i≥2, while verifying the Cσ bound in each case. Note that, because seeds are typically expected to be of higher quality, we can apply a more stringent threshold on seed fragment size matches (for example, we used CσSeed=2).

As shown in the “Results and discussion” section, this approach significantly reduces the space of candidate alignments without affecting the sensitivity of the search. A comparison between the various seeding approaches is shown in Fig. Fig.2,2, which highlights the advantages of composite seeds with respect to 2-mers.

Overall, the computational cost of finding seeds using this approach is O(m (logn+c #seedsc=1)) per experimental map, where n is the total length of the in silico maps in fragments, mn is the length of the experimental map and #seedsc=1 is the number of seeds found in the first level of the index lookup, before narrowing down the list to the actual number of seeds that will be extended (#seeds). The cost and space of creating the reference index is thus O(c n), if the number of errors considered in the composite seeds is limited and bounded (as in Definition 4), and radix sort is used to sort the index. This approach drastically reduces the number of alignments computed in comparison to more general, global alignment searches [10], as will be shown later in the “Results and discussion” section.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A