The snippy pipeline v3.1 ( was used for variant calling. Briefly, this entailed mapping reads against H37Rv with the BWA mem algorithm (v.0.7.15-r1110) and marking split hits as secondary and then calling variants with SAMtools v1.3 and including only reads with a mapping quality of 60 or higher. Variants were then filtered further using FreeBayes (v1.0.2) with a ploidy of 1 and options to exclude (i) alleles if a supporting base quality is less than 20 or the coverage less than 10, (ii) alignments if the mapping quality is less than 60, and (iii) alleles if the fraction of reads in support of a SNP is not at least 90%. The binomial priors about observation expectations were turned off. The program snpEff v4.1l was then used to annotate SNPs, turning off downstream, upstream, intergenic, and 5′ untranslated region (5′UTR) and 3′UTR changes. A whole-genome alignment of all genomes was then built using snippy core, with a minimum coverage depth of 10 to consider a region part of the core.

The genome sequences from the Hungarian mummies were resolved in a different manner: Sequence read archives from study PRJEB7454 (37) were mapped to H37Rv (41). A minimum read length of 35 bp (base pairs) and a minimum mapping quality of 30 were imposed. Pilon was run with the following parameters: – variant – mindepth (10) – minmq (30) – minqual (30), and the number of reads found supporting each allele across all variant sites were manually inspected. Three genotypes were inferred from two high-coverage sequencing runs [ERR651000 (individual 68) and ERR651004 (individual 92)]. Genotypes were distinguished on the basis of allele frequencies. For individual 68, genotype 1 was deduced by alleles found between 55 and 65% and genotype 2 by alleles found between 35 and 45%. Variants at frequencies of ≥95% were considered fixed between the mixed infecting strains. Variants segregating at other frequencies were treated as ambiguous/missing data. For individual 92, we also found evidence indicative of a mixed infection; however, we only felt confident with the most called genotype, which comprised >90% of reads. All variants at ≥95% were called. All other sequencing runs from the project were of low coverage and excluded from the analyses.

We then used an in-house python script (available at to exclude, from the alignment, SNPs matching any of the following criteria: (i) located in a known repetitive region (e.g., PE/PPE genes, annotation file available at github repository), (ii) the proportion of ambiguous calls at the locus exceeded 1%, and (iii) the position was no longer polymorphic after pruning of outbreak isolates (only applied to the down-sampled dataset). The final alignment of SNPs consisted of 22,912 sites, with 9313 of these being parsimony informative.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.