To construct a MinHash sketch, Mash first determines the set of constituent k-mers by sliding a window of length k across the sequence. Mash supports arbitrary alphabets (e.g. nucleotide or amino acid) and both assembled and unassembled sequences. Without loss of generality, here we will assume a nucleotide alphabet Σ = {A,C,G,T}. Depending on the alphabet size and choice of k, each k-mer is hashed to either a 32-bit or 64-bit value via a hash function, h. For nucleotide sequence, Mash uses canonical k-mers by default to allow strand-neutral comparisons. In this case, only the lexicographically smaller of the forward and reverse complement representations of a k-mer is hashed. For a given sketch size s, Mash returns the s smallest hashes output by h over all k-mers in the sequence (Fig. 1). Typically referred to as a “bottom-k sketch” for a sketch of size k, we refer to these simply as “bottom sketches” to avoid confusion with the k-mer size k. For a sketch size s and genome size n, a bottom sketch can be efficiently computed in O(n log s) time by maintaining a sorted list of size s and updating the current sketch only when a new hash is smaller than the current sketch maximum. Further, the probability that the i-th hash of the genome will enter the sketch is s/i, so the expected runtime of the algorithm is O(n + s log s log n) [4], which becomes nearly linear when n > > s.
As demonstrated by Fig. 3, a sketch comprising 400 32-bit hash values is sufficient to roughly group microbial genomes by species. With these parameters, the resulting sketch size equals 1.6 kB for each genome. For large genomes, this represents an enormous lossy compression (e.g. compared to the 750 MB needed to store a 3 Gbp genome using 2-bit encoding). However, the probability of a given k-mer K appearing in a random genome X of size n is:
Thus, for k = 16 the probability of observing a given k-mer in a 3 Gbp genome is 0.50 and 25 % of k-mers are expected to be shared between two random 3 Gbp genomes by chance alone. This will skew any k-mer based distance and make distantly related genomes appear more similar than reality. To avoid this phenomenon, it is sufficient to choose a value of k that minimizes the probability of observing a random k-mer. Given a known genome size n and the desired probability q of observing a random k-mer (e.g. 0.01), this can be computed as [41]:
which yields k = 14 and k = 19 for 5 Mbp and 3 Gbp genomes (q = 0.01), respectively. We have found the parameters k = 21 and s = 1000 give accurate estimates in most cases (including metagenomes), so this is set as the default and still requires just 8 kB per sketch. However, for constructing the RefSeq database, k = 16 was chosen so that each hash could fit in 32 bits, minimizing the database size at the expense of reduced specificity for larger genomes. The small k also improves sensitivity, which helps when comparing noisy data like single-molecule sequencing (Additional file 1: Figures S2 and S3).
Lastly, for sketching raw sequencing reads, Mash provides both a two-stage MinHash and Bloom filter strategy to remove erroneous k-mers. These approaches assume that redundancy in the data (e.g. depth of coverage >5) will result in true k-mers appearing multiple times in the input, while false k-mers will appear only a few times. Given a coverage threshold c, Mash can optionally ignore such low-abundance k-mers with counts less than c. By default, the coverage threshold is set to one and all k-mers are considered for the sketch. Increasing this threshold enables the two-stage MinHash filter strategy, which is based on tracking both the k-mer hashes in the current sketch and a secondary set of candidate hashes. At any time, the current sketch contains the s smallest hashes of all k-mers that have been observed at least c times and the candidate set contains hashes that are smaller than the largest value in the sketch (sketch max), but have been observed less than c times. When processing new k-mers, those with a hash greater than the sketch max are immediately discarded, as usual. However, if a new hash is smaller than the current sketch max, it is checked against the candidate set. If absent, it is added to this set. If present with a count less than c – 1, its counter is incremented. If present with a count of c – 1 or greater, it is removed from the candidate set and added to the sketch. At this point, the sketch max has changed and the candidate set can be pruned to contain only values less than the new sketch maximum. The result of this online method is equivalent to running the MinHash algorithm on only those k-mers that occur c or more times in the input. However, in the worst case, if all k-mers in the input occur less than the coverage threshold c, no hashes would escape the candidate set and memory use would increase with each new k-mer processed.
Alternatively, a Bloom filter can be used to probabilistically exclude single-copy k-mers using a fixed amount of memory. In this approach, a Bloom filter is maintained instead of a candidate list and new hashes are inserted into the sketch only if they are less than sketch max and found in the Bloom filter. If a new hash would have otherwise been inserted in the sketch but was not found in the Bloom filter, it is inserted into the Bloom filter so that subsequent appearances of the hash will pass. This effectively excludes many single-copy k-mers from the sketch, but does not guarantee that all will be filtered. With this approach, filtering k-mers with a copy number greater than one would also be possible using a counting Bloom filter, but this has not been implemented since the exact method typically outperforms the Bloom method in practice, both in terms of accuracy and memory usage.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.