After initial analysis, we sought local references to better understand observed diversity. Unfortunately, assembled Brazilian whole genome sequences are vanishingly rare, and we reached out to Drs. Brynildsrud and Eldholm, who in 2018 were first and corresponding authors, respectively, on an expansive investigation into Mycobacterium tuberculosis lineage 4 isolates. Their dataset includes hundreds of paired-end FASTQ read sets for South American tuberculosis genomes. Reads for 9 Brazilian isolates unrelated to our current study were downloaded from NCBI (BioProject PRJEB27366) and processed utilizing the snippy pipeline62. Briefly, we included CLC-constructed contigs for all 66 of our isolates that passed QC, assembled whole genomes for global references, and raw paired-end FASTQ reads for 9 isolates (referred to in this manuscript as “local references”) from the Brynildsrud et al., dataset (Supplemental 7—Isolates & References). These local references represent sublineages 4.3 (LAM, n = 7), 4.4 (S-type, n = 1), and 4.7 (Congo, n = 1)63. They were first processed by ABySS64 (abyss-pe, default parameters, k = 96) to construct contigs, and they, along with all contigs from our isolates and assembled genomes for global references, were batch-processed in snippy with the M. tuberculosis H37Rv Genbank (.gb) file as reference. The snippy output folders for each isolate and local references were processed by snippy-core using H37Rv.gb as reference. This output core alignment, comprised solely of variant sites and excluding complex changes like indels, was analyzed in 16 drug resistance-associated genes. To avoid the effects of uninformative, false positive hits in repetitive regions of the genome, snippy-core was run again with the mask parameter and the M. tuberculosis H37Rv-based .bed mask file included in the snippy package to filter out loci like PE/PPE family proteins. The unmasked and masked core SNP sets were compared, and ~ 3% of SNPs in both isolates and local references were considered uninformative by this approach. Masked core SNPs were passed into MEGA-X v10.1.865 and MrBayes v3.2.7a66 separately. Maximum-likelihood trees were produced with 500 bootstrap replicates and the HKY model with other parameters default, and the tree with highest log-likelihood is shown (Fig. 1A). In MrBayes, the GTR model was used, with the parameter nst = mixed used to sample across the GTR model space. After a MCMC run length of 500,000, the standard deviation of split frequencies fell well below 0.01 as recommended, minimum ESS values were 300 or higher, above the 100 recommended, and PSRF values were ~ 1.000 (Supplemental 5—Bayesian Analysis). A quick comparison of ML phylogeny (HKY, default parameters, bs = 100) between masked and unmasked core SNPs showed greater bootstrap values for nodes derived from the masked set, suggesting removing uninformative hits yielded a more robust phylogeny (Supplemental 9—Masked vs Unmasked Phylogeny). As such, the masked SNP set was used for final analysis. Trees were visualized in FigTree and MEGA-X. Final labeling of trees (Fig. 1) was performed with Inkscape v.1.067, but unmodified Newick tree files are also included for transparency (Supplemental 4 and 5).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.