To estimate substitution rates by means of sampling date calibrated Bayesian evolutionary analyses (52), we down-sampled the collection to a manageable size by including a maximum of 20 randomly chosen isolates from each country. This resulted in a sample collection of 269 genomes. Three ancient M.tb genomes from 18th century Hungarian mummies (37) and five Danish isolates from the 1960s (38) were included to provide temporal structure to the data.

A total of 7994 variable sites remained after selecting the 269 genomes. A preliminary check using TempEst (53) confirmed a moderate but highly significant temporal signal in the data (fig. S4), which was also confirmed by tip randomization (see below). A GTR substitution model was chosen on the basis of model testing, as described above. On the basis of marginal-likelihood estimation (MLE), an exponential demographic model was found to best fit the data [tested against the constant population size and Gaussian Markov Random Fields (GMRF) skyride models (54)]. Also based on MLE, an uncorrelated relaxed clock was favored over a strict clock model. Three independent BEAST Markov Chain Monte Carlo (MCMC) chains were run, and convergence to a stationary posterior distribution was confirmed both within and between chains. These analyses resulted in an estimated substitution rate of 4.84 × 10−8 (95% HPD, 4.16 × 10−8 to 5.44 × 10−8) substitutions per site per year and an estimated time of the MRCA in 1096 CE (95% HPD, 955 to 1231). To assess the robustness of the temporal inference, we performed 10 additional runs after randomization of the sampling dates (55). None of the randomized runs had rate estimates overlapping with the inference using real sampling dates (fig. S3), supporting the robustness of the original inference.

We then ran a larger dataset (1207 genomes, after retaining a maximum of five representatives from each of three densely sampled outbreaks from Argentina and 10 from Quebec, Canada) in BEAST v1.8.4 (56) with a fixed substitution rate as estimated above. The MRCA of this tree was inferred to have existed in the year 1157 (HPD intervals not reported due to our application of a fixed rate in this analysis). The overall accuracy of the dating inferences in BEAST was further supported by independent analyses applying Least Squares Dating (LSD) v0.3 (57), resulting in an estimated time to MRCA in 1195 CE (95% HPD, 1061 to 1270) for the 1207 genome dataset. The tree generated for 1207 genomes in BEAST v1.8.4 was used as an input tree for phylogeographic analyses using both the BASTA module in BEAST2 and simple DTA as implemented in BEAST v1.8.4 (9).

To reduce computational complexity and increase post-analysis interpretability, we collapsed the country of origin information into five distinct UN regions: North America (United States and Canada), South America (Brazil, Argentina, and Peru), Africa (Democratic Republic of the Congo, Malawi, South Africa, and Uganda), Europe (Denmark, the Netherlands, Hungary, Portugal, Russia, and United Kingdom), and Southeast Asia (Vietnam).

In the discrete trait model (7), we used a GTR model, restricted the clock rate to 4.84 × 10−8, and set the starting tree as specified above. The prior population size was set as constant. A symmetric deme substitution model was used, and we turned Bayesian stochastic search variable selection on. Ancestral states were reconstructed at all nodes. Other priors were left at default values. The chain was run for 10,000,000 generations with logging every 10,000th iteration, and three of these runs were combined to create the final posterior sample of trees and parameters. We verified chain convergence and good mixing and an effective sample size (ESS) > 200 for all parameters using Tracer (http://tree.bio.ed.ac.uk/software/tracer/). A maximum clade credibility (MCC) tree was created using TreeAnnotator (http://beast.community/treeannotator), with 20% of the chain discarded as burn in. The resulting tree displayed a European root with 92% posterior probability.

A separate phylogeographic reconstruction of the L4.3 RdRio family was also performed. This was completed by manually extracting the RdRio subtree from the full time tree and then setting up a new DTA run consisting of these 243 isolates alone. For this analysis, we acquired patient country of origin data for genomes from our Portuguese and Dutch collections, which led to a changed country of origin for 13 genomes in this dataset. To restrict the number of possible demes of low/single sample size, we collapsed countries into eight different geographic categories: Iberia (Portugal, Spain, and Cape Verde), North Europe (United Kingdom, Russia, Bosnia, Netherlands, and Germany), Peru, Atlantic South America (Venezuela, Brazil, Aruba, and Suriname), Congo/Angola, Malawi/Mozambique, Uganda, and South Africa. The prior population size was set to 10,000. Other than this, the analysis was run with the same parameters as for the full dataset DTA. The rationale for placing Cape Verde in the Iberia category is that Cape Verde was uninhabited before Portuguese settlement in the 15th century.

As a complement to DTA, we used the BEAST2 module BASTA v2.3.1 (8) for phylogeographic inferences. We specified a migration model with the same five demes as above. The initial values for deme transition rate were set to 1.0 × 10−3 and the subpopulation to 6000, and these numbers correspond to the median outputs from DTA. The rate matrix and population size priors were given log-normal prior distributions with M = −10 and S = 2.0 and M = 9.0 and S = 0.6, respectively. Since it was not possible to place deme restrictions on internal nodes in BASTA, we artificially introduced an isolate with a branch length of 1.0 × 10−10 from the root, and the location of this isolate was set to correspond to each of the five demes in different runs. The results of these five runs were subsequently evaluated jointly. We ran each chain for 1,000,000 generations with storing set to every 1000th iteration. We disabled all scaling operators except the rate scaler, which was given a scale factor of 0.8 and a weight of 1.0, and the population size scaler, which was given a weight of 3 and a scale factor of 0.8, and degrees of freedom was set to 1 (subpopulation sizes effectively set to equal). Since we knew that the MRCA of all isolates existed roughly around the year 1100 CE, before European colonization of the Americas, we discarded all trees with a root inferred to be from North or South America. The parameter logs were inspected in the same way as described for DTA, and an MCC tree was made using TreeAnnotator v.2.4.5, with burn in set to 20% and node heights set to median. All BEAST and BEAST2 runs were run locally or on the Cyberinfrastructure for phylogenetic research (CIPRES) science gateway (58).

Migration matrices (10) were constructed to visualize the overall patterns of migration inferred by the two methods. Both methods inferred Europe to have played a pivotal role in sourcing the rest of the world with TB (fig. S5).

To study migration over time, conceptually mirroring the methodology used in (10), we wrote an in-house script (available at https://github.com/admiralenola/globall4scripts) to read the MCC tree from the BASTA runs using the ETE toolkit (59) and traversed the time tree in a sliding window fashion, for each year writing out the number of branches corresponding to the five different demes. In our analyses, migration events were set to occur on nodes but, in reality, could have occurred at any point along the branch downstream of this node. This introduces a slight bias toward inflated ages of migration events, which is most pronounced for very early migration events but negligible for later migrations due to extensive branching.

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.