A MinHash sketch of size s = 1 is equivalent to the subsequent “minimizer” concept of Roberts et al. [42], which has been used in genome assembly [43], k-mer counting [44], and metagenomics [45]. Importantly, the more general MinHash concept permits an approximation of the Jaccard index between two k-mer sets A and B. Mash follows Broder’s original formulation and merge-sorts two bottom sketches S(A) and S(B) to estimate the Jaccard index [4]. The merge is terminated after s unique hashes have been processed (or both sketches exhausted), and the Jaccard estimate is computed as for x shared hashes found after processing s’ hashes. Because the sketches are stored in sorted order, this requires only O(s) time and effectively computes:
which is an unbiased estimate of the true Jaccard index, as illustrated in Fig. 1. Conveniently, the error bound of the Jaccard estimate relies only on the sketch size and is independent of genome size [46]. Specific confidence bounds are given below and in Additional file 1: Figure S1. Note, however, that the relative error can grow quite large for very small Jaccard values (i.e. divergent genomes). In these cases, a larger sketch size or smaller k is needed to compensate. For flexibility, Mash can also compare sketches of different size, but such comparisons are constrained by the smaller of the two sketches s < u and only the s smallest values are considered.
The Jaccard index is a useful measure of global sequence similarity because it correlates with ANI, a common measure of global sequence similarity. However, like the MUM index [19], J is sensitive to genome size and simultaneously captures both point mutations and gene content differences. For distance-based applications, the Jaccard index can be converted to the Jaccard distance Jδ(A, B) = 1 − J(A, B), which is related to the q-gram distance but without occurrence counts [47]. This can be a useful metric for clustering, but is non-linear in terms of the sequence mutation rate. In contrast, the Mash distance D seeks to directly estimate a mutation rate under a simple Poisson process of random site mutation. As noted by Fan et al. [22], given the probability d of a single substitution, the expected number of mutations in a k-mer is λ = kd. Thus, under a Poisson model (assuming unique k-mers and random, independent mutation), the probability that no mutation will occur in a given k-mer is e−kd, with an expected value equal to the fraction of conserved k-mers w to the total number of k-mers t in the genome, . Solving gives . To account for two genomes of different sizes, Fan et al. [22] set t to the smaller of the two genome’s k-mer counts, thereby measuring containment of the k-mer set. In contrast, Mash sets t to the average genome size n, thereby penalizing for genome size differences and measuring resemblance (e.g. to avoid a distance of zero between a phage and a genome containing that phage). Finally, because the Jaccard estimate j can be framed in terms of the average genome size , the fraction of shared k-mers can be framed in terms of the Jaccard index , yielding the Mash distance:
Equation 4 carries many assumptions and does not attempt to model more complex evolutionary processes, but closely approximates the divergence of real genomes (Fig. 2). With appropriate choices of s and k, it can be used as a replacement for costly ANI computations. Table 1 and Additional file 1: Figure S2 give error bounds on the Mash distance for various sketch sizes and Additional file 1: Figure S3 illustrates the relationship between the Jaccard index, Mash distance, k-mer size, and genome size.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.