To obtain gClusters, pairwise SNP distances between all sequences in the dataset must be calculated. These were computed with the R package ape using a customised script as described below:
First, a multifasta file must be constructed including all the samples to be evaluated. Multifasta can be constructed in different ways, but in the paper we used our in-house script. The input files - SNP and InDel files- are formatted according to our established Bioinformatics Analysis Pipeline. The Pipeline is described in detail in our lab repository (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated). The script to construct multifasta file is named “3_SNP_impact_prediction_ThePipeline.sh (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/3_SNP_impact_prediction_ThePipeline.sh?ref_type=heads)
To calculate distances, the known resistance positions are filtered out of the multifasta file. This step is performed by the multifasta construction script, the output file is named “run_alignment_no_resis.fas”. This is the input file for the script that calculates distances, in the paper we used the script named “calculate_distances_New.R” (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/calculate_distances_new.R?ref_type=heads). This is an accurate script but rather slow, thus a new and more efficient script was recently rewritten in Python, it was tested several times and the results of both, old and new, scripts are exactly the same (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/python_versions/Distances.py?ref_type=heads). Pairwise distances among all the sequences in the multifasta will be computed; the script has an option to define the SNP threshold obtaining an additional and reduced distance file, which will accelerate the definition of clusters.
Once the distances file is ready, the last step is to define the clusters. The script concatenates all sequences within the set distance threshold. It can be done by using the complete distances file or the reduced one since the SNP threshold can be defined. It is recommended to use the reduced file, otherwise a list of clusters, with only one sequence, will be defined with those which fall outside the distance threshold. The script used in the paper is named “script_get_clusters.py” (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/script_get_clusters.py?ref_type=heads). Recently, the code was optimised to speed-up the computational time for cluster definition (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/python_versions/Clusters.py?ref_type=heads). Both scripts generate the same results.
Once the clusters are defined, it is recommended to evaluate the monophyly of all of them, specially in settings where transmission is high and continuous in time, making clusters harder to define. To evaluate monophyly, a phylogeny must be built using any chosen tool. We reconstruct the phylogeny using RAxML v8 (Stamatakis, 2014) as described in detail in the Appendix 1 of the paper. Monophyly of each cluster was inspected manually.
Timed phylogenies
Beast v2.5.1 (Bouckaert et al., 2014) was used to infer timed phylogenies
Multifasta files were the input file for beauti tool, the graphical-user interface to generate the XML files, which are the input of the beast program.
Several parameters must be defined to run beast, related to Site model, Clock model and model priors. They must be set-up according to the knowledge of the organism under study. A mock XML file (without sequences and trees) is included in the paper as an example (Supplementary_File2.xml). It contains all the parameters we used.
In the case of M. tuberculosis, and the specific dataset evaluated, we used: (parameters chosen in bold)
Site Model: GTR model
Gamma Category Count: 4
Frequencies: Estimated
Clock Model: Strict Clock
Clock.rate: clock rate was adjusted to the alignment length as follow; alignment_lenght=14910, clock rate= 4.6E-8 (Bos et al. 2014); TB full genome = 4411532; adjusted clock rate = 4.6e-8*4411532/14910 = 1.63E-5, the value indicated as parameter. Upper and lower values must be equally calculated; the values used as reference are those published in (Bos et al., 2014): 4.6 × 10 −8 [3 × 10 −8 - 6.2 × 10 −8 ]. A different ascertainment bias correction was also evaluated, as suggested by the beast community (https://groups.google.com/g/beast-users/c/QfBHMOqImFE); the results are detailed in the paper.
Estimate: thick.
Priors: Tree: Coalescent Constant Population
clock.rate: LogNormal M=3.3, S=0.2; Mean In Real Space: thick
In order to include calibration data, a group of well-dated sequences was included in the analysis. They belong to Ancient TB DNA samples from (Bos et al., 2014) and samples from a Spanish outbreak, recently published (López et al., 2023). Priors related to their origin were set as:
Distribution: Normal. Mean and Sigma values were set in accordance with the date values reported for each sample.
Finally, XML files were used as input for beast software. Three independent runs of Markov Chain Monte-Carlo (MCMC) length chains of 10 million samples were performed; 10,000 samples and 1000 trees were kept to summarise the results. Tracer v1.6 was used to determine adequate mixing, convergence of chains and sufficient sampling for every parameter (effective sample sizes, ESS >200), after a 10% burn-in.
Trees were annotated with the TreeAnnotator tool, included in the beast tool, using the Maximum Credibility Tree (MCC) option; they were visualised with FigTree v1.4.4.
More information about the models and use of beast tool can be obtained from https://taming-the-beast.org/,https://www.beast2.org/ and Drummond and Bouckaert (2015)
References
Bos KI, Harkins KM, Herbig A, Coscolla M, Weber N, Comas I, Forrest SA, Bryant JM, HarrisSR, Schuenemann VJ, Campbell TJ, Majander K, Wilbur AK, Guichon RA, Wolfe Steadman DL, Cook DC, Niemann S, Behr MA, Zumarraga M, Bastida R, Huson D, Nieselt K, Young D, Parkhill J, Buikstra JE, Gagneux S, Stone AC, Krause J. 2014. Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature 514:494–497.
Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, Suchard MA, Rambaut A, Drummond AJ. 2014. BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Comput Biol 10:e1003537.
Drummond AJ, Bouckaert RR. 2015. Bayesian Evolutionary Analysis with BEAST. Cambridge University Press.
López MG, Campos-Herrero MI, Torres-Puente M, Cañas F, Comín J, Copado R, Wintringer P, Iqbal Z, Lagarejos E, Moreno-Molina M, Pérez-Lago L, Pino B, Sante L, de Viedma DG, Samper S, Comas I. 2023. Deciphering the Tangible Spatio-Temporal Spread of a 25-Year Tuberculosis Outbreak Boosted by Social Determinants. Microbiology Spectrum. doi:10.1128/spectrum.02826-22
Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.
Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:
López, M and Comas, I(2023). gClustering and phylogenetic analyses. Bio-protocol Preprint. bio-protocol.org/prep2462.
Cancino-Muñoz, I., López, M. G., Torres-Puente, M., Villamayor, L. M., Borrás, R., Borrás-Máñez, M., Bosque, M., Camarena, J. J., Colijn, C., Colomer-Roig, E., Colomina, J., Escribano, I., Esparcia-Rodríguez, O., García-García, F., Gil-Brusola, A., Gimeno, C., Gimeno-Gascón, A., Gomila-Sard, B., Gónzales-Granda, D., Gonzalo-Jiménez, N., Guna-Serrano, M. R., López-Hontangas, J. L., Martín-González, C., Moreno-Muñoz, R., Navarro, D., Navarro, M., Orta, N., Pérez, E., Prat, J., Rodríguez, J. C., Ruiz-García, M. M., Vanaclocha, H. and Comas, I.(2022). Population-based sequencing of Mycobacterium tuberculosis reveals how current population dynamics are shaped by past epidemics. eLife. DOI: 10.7554/eLife.76605
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this
article to respond.
0/150
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Spinning
Post a Question
0 Q&A
Spinning
This protocol preprint was submitted via the "Request
a Protocol" track.