gClustering and phylogenetic analyses

Mariana G López; Iñaki Comas

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Preprint

gClustering and phylogenetic analyses

ML Mariana G López

IC Iñaki Comas

Last updated date: Oct 9, 2023 Views: 293 Forks: 0

An abbreviated version of this protocol was published in eLife in Jul, 2022

Population-based sequencing of Mycobacterium tuberculosis reveals how current population dynamics are shaped by past epidemics

Download PDF

Ask a question

How to cite

Favorite

To obtain gClusters, pairwise SNP distances between all sequences in the dataset must be calculated. These were computed with the R package ape using a customised script as described below:

First, a multifasta file must be constructed including all the samples to be evaluated. Multifasta can be constructed in different ways, but in the paper we used our in-house script. The input files - SNP and InDel files- are formatted according to our established Bioinformatics Analysis Pipeline. The Pipeline is described in detail in our lab repository (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated). The script to construct multifasta file is named “3_SNP_impact_prediction_ThePipeline.sh (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/3_SNP_impact_prediction_ThePipeline.sh?ref_type=heads)
To calculate distances, the known resistance positions are filtered out of the multifasta file. This step is performed by the multifasta construction script, the output file is named “run_alignment_no_resis.fas”. This is the input file for the script that calculates distances, in the paper we used the script named “calculate_distances_New.R” (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/calculate_distances_new.R?ref_type=heads). This is an accurate script but rather slow, thus a new and more efficient script was recently rewritten in Python, it was tested several times and the results of both, old and new, scripts are exactly the same (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/python_versions/Distances.py?ref_type=heads). Pairwise distances among all the sequences in the multifasta will be computed; the script has an option to define the SNP threshold obtaining an additional and reduced distance file, which will accelerate the definition of clusters.
Once the distances file is ready, the last step is to define the clusters. The script concatenates all sequences within the set distance threshold. It can be done by using the complete distances file or the reduced one since the SNP threshold can be defined. It is recommended to use the reduced file, otherwise a list of clusters, with only one sequence, will be defined with those which fall outside the distance threshold. The script used in the paper is named “script_get_clusters.py” (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/script_get_clusters.py?ref_type=heads). Recently, the code was optimised to speed-up the computational time for cluster definition (https://gitlab.com/tbgenomicsunit/thepipeline_v1.updated/-/blob/main/tools/python_versions/Clusters.py?ref_type=heads). Both scripts generate the same results.
Once the clusters are defined, it is recommended to evaluate the monophyly of all of them, specially in settings where transmission is high and continuous in time, making clusters harder to define. To evaluate monophyly, a phylogeny must be built using any chosen tool. We reconstruct the phylogeny using RAxML v8 (Stamatakis, 2014) as described in detail in the Appendix 1 of the paper. Monophyly of each cluster was inspected manually.

Timed phylogenies

Beast v2.5.1 (Bouckaert et al., 2014) was used to infer timed phylogenies

Multifasta files were the input file for beauti tool, the graphical-user interface to generate the XML files, which are the input of the beast program.
Several parameters must be defined to run beast, related to Site model, Clock model and model priors. They must be set-up according to the knowledge of the organism under study. A mock XML file (without sequences and trees) is included in the paper as an example (Supplementary_File2.xml). It contains all the parameters we used.
In the case of M. tuberculosis, and the specific dataset evaluated, we used: (parameters chosen in bold)
- Site Model: GTR model
  - Gamma Category Count: 4
  - Frequencies: Estimated
- Clock Model: Strict Clock
  - Clock.rate: clock rate was adjusted to the alignment length as follow; alignment_lenght=14910, clock rate= 4.6E-8 (Bos et al. 2014); TB full genome = 4411532; adjusted clock rate = 4.6e-8*4411532/14910 = 1.63E-5, the value indicated as parameter. Upper and lower values must be equally calculated; the values used as reference are those published in (Bos et al., 2014): 4.6 × 10 −8 [3 × 10 −8 - 6.2 × 10 −8 ]. A different ascertainment bias correction was also evaluated, as suggested by the beast community (https://groups.google.com/g/beast-users/c/QfBHMOqImFE); the results are detailed in the paper.
  - Estimate: thick.
- Priors: Tree: Coalescent Constant Population
  - clock.rate: LogNormal M=3.3, S=0.2; Mean In Real Space: thick
  - freqParameter: Uniform, Lower=0, Upper=1, Offset=0 (default values)
  - gammaShape: Exponential, Mean=1, Offset=0 (default values)
  - rate(all): Gamma, Alpha=0.05, Beta=10, Mode=ShapeScale, Offset=0 (default values)
In order to include calibration data, a group of well-dated sequences was included in the analysis. They belong to Ancient TB DNA samples from (Bos et al., 2014) and samples from a Spanish outbreak, recently published (López et al., 2023). Priors related to their origin were set as:
- Distribution: Normal. Mean and Sigma values were set in accordance with the date values reported for each sample.
Finally, XML files were used as input for beast software. Three independent runs of Markov Chain Monte-Carlo (MCMC) length chains of 10 million samples were performed; 10,000 samples and 1000 trees were kept to summarise the results. Tracer v1.6 was used to determine adequate mixing, convergence of chains and sufficient sampling for every parameter (effective sample sizes, ESS >200), after a 10% burn-in.
Trees were annotated with the TreeAnnotator tool, included in the beast tool, using the Maximum Credibility Tree (MCC) option; they were visualised with FigTree v1.4.4.

More information about the models and use of beast tool can be obtained from https://taming-the-beast.org/, https://www.beast2.org/ and Drummond and Bouckaert (2015)

References

Bos KI, Harkins KM, Herbig A, Coscolla M, Weber N, Comas I, Forrest SA, Bryant JM, HarrisSR, Schuenemann VJ, Campbell TJ, Majander K, Wilbur AK, Guichon RA, Wolfe Steadman DL, Cook DC, Niemann S, Behr MA, Zumarraga M, Bastida R, Huson D, Nieselt K, Young D, Parkhill J, Buikstra JE, Gagneux S, Stone AC, Krause J. 2014. Pre-Columbian mycobacterial genomes reveal seals as a source of New World human tuberculosis. Nature 514:494–497.
Bouckaert R, Heled J, Kühnert D, Vaughan T, Wu C-H, Xie D, Suchard MA, Rambaut A, Drummond AJ. 2014. BEAST 2: A Software Platform for Bayesian Evolutionary Analysis. PLoS Comput Biol 10:e1003537.
Drummond AJ, Bouckaert RR. 2015. Bayesian Evolutionary Analysis with BEAST. Cambridge University Press.
López MG, Campos-Herrero MI, Torres-Puente M, Cañas F, Comín J, Copado R, Wintringer P, Iqbal Z, Lagarejos E, Moreno-Molina M, Pérez-Lago L, Pino B, Sante L, de Viedma DG, Samper S, Comas I. 2023. Deciphering the Tangible Spatio-Temporal Spread of a 25-Year Tuberculosis Outbreak Boosted by Social Determinants. Microbiology Spectrum. doi:10.1128/spectrum.02826-22
Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313.

How to cite：

Readers should cite both the Bio-protocol preprint and the original research article where this protocol was used:

López, M and Comas, I(2023). gClustering and phylogenetic analyses. Bio-protocol Preprint. bio-protocol.org/prep2462.
Cancino-Muñoz, I., López, M. G., Torres-Puente, M., Villamayor, L. M., Borrás, R., Borrás-Máñez, M., Bosque, M., Camarena, J. J., Colijn, C., Colomer-Roig, E., Colomina, J., Escribano, I., Esparcia-Rodríguez, O., García-García, F., Gil-Brusola, A., Gimeno, C., Gimeno-Gascón, A., Gomila-Sard, B., Gónzales-Granda, D., Gonzalo-Jiménez, N., Guna-Serrano, M. R., López-Hontangas, J. L., Martín-González, C., Moreno-Muñoz, R., Navarro, D., Navarro, M., Orta, N., Pérez, E., Prat, J., Rodríguez, J. C., Ruiz-García, M. M., Vanaclocha, H. and Comas, I.(2022). Population-based sequencing of Mycobacterium tuberculosis reveals how current population dynamics are shaped by past epidemics. eLife. DOI: 10.7554/eLife.76605

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

This protocol preprint was submitted via the "Request a Protocol" track.

Share your protocol with your peers.

Submit a Preprint Protocol