Mash distance

Brian D. Ondov; Todd J. Treangen; Páll Melsted; Adam B. Mallonee; Nicholas H. Bergman; Sergey Koren; Adam M. Phillippy

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Mash distance

BO Brian D. Ondov

TT Todd J. Treangen

PM Páll Melsted

AM Adam B. Mallonee

NB Nicholas H. Bergman

SK Sergey Koren

AP Adam M. Phillippy

This method is extracted from research article: Genome Biol, Jun 2016

Mash: fast genome and metagenome distance estimation using MinHash

DOI: 10.1186/s13059-016-0997-x

Request a Protocol

Ask a question

Favorite

A MinHash sketch of size s = 1 is equivalent to the subsequent “minimizer” concept of Roberts et al. [42], which has been used in genome assembly [43], k-mer counting [44], and metagenomics [45]. Importantly, the more general MinHash concept permits an approximation of the Jaccard index $J (A, B) = \frac{|A \cap B|}{|A \cup B|}$ between two k-mer sets A and B. Mash follows Broder’s original formulation and merge-sorts two bottom sketches S(A) and S(B) to estimate the Jaccard index [4]. The merge is terminated after s unique hashes have been processed (or both sketches exhausted), and the Jaccard estimate is computed as $j = \frac{x}{s'}$ for x shared hashes found after processing s’ hashes. Because the sketches are stored in sorted order, this requires only O(s) time and effectively computes:

which is an unbiased estimate of the true Jaccard index, as illustrated in Fig. 1. Conveniently, the error bound of the Jaccard estimate $ε = O (\frac{1}{\sqrt{s}})$ relies only on the sketch size and is independent of genome size [46]. Specific confidence bounds are given below and in Additional file 1: Figure S1. Note, however, that the relative error can grow quite large for very small Jaccard values (i.e. divergent genomes). In these cases, a larger sketch size or smaller k is needed to compensate. For flexibility, Mash can also compare sketches of different size, but such comparisons are constrained by the smaller of the two sketches s < u and only the s smallest values are considered.

The Jaccard index is a useful measure of global sequence similarity because it correlates with ANI, a common measure of global sequence similarity. However, like the MUM index [19], J is sensitive to genome size and simultaneously captures both point mutations and gene content differences. For distance-based applications, the Jaccard index can be converted to the Jaccard distance J_δ(A, B) = 1 − J(A, B), which is related to the q-gram distance but without occurrence counts [47]. This can be a useful metric for clustering, but is non-linear in terms of the sequence mutation rate. In contrast, the Mash distance D seeks to directly estimate a mutation rate under a simple Poisson process of random site mutation. As noted by Fan et al. [22], given the probability d of a single substitution, the expected number of mutations in a k-mer is λ = kd. Thus, under a Poisson model (assuming unique k-mers and random, independent mutation), the probability that no mutation will occur in a given k-mer is e^−kd, with an expected value equal to the fraction of conserved k-mers w to the total number of k-mers t in the genome, $\frac{w}{t}$ . Solving $e^{- k d} = \frac{w}{t}$ gives $d = - \frac{1}{k} ln \frac{w}{t}$ . To account for two genomes of different sizes, Fan et al. [22] set t to the smaller of the two genome’s k-mer counts, thereby measuring containment of the k-mer set. In contrast, Mash sets t to the average genome size n, thereby penalizing for genome size differences and measuring resemblance (e.g. to avoid a distance of zero between a phage and a genome containing that phage). Finally, because the Jaccard estimate j can be framed in terms of the average genome size $j = \frac{w}{2 n - w}$ , the fraction of shared k-mers can be framed in terms of the Jaccard index $\frac{w}{n} = \frac{2 j}{1 + j}$ , yielding the Mash distance:

Equation 4 carries many assumptions and does not attempt to model more complex evolutionary processes, but closely approximates the divergence of real genomes (Fig. 2). With appropriate choices of s and k, it can be used as a replacement for costly ANI computations. Table 1 and Additional file 1: Figure S2 give error bounds on the Mash distance for various sketch sizes and Additional file 1: Figure S3 illustrates the relationship between the Jaccard index, Mash distance, k-mer size, and genome size.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol