Analyses of admixture through f-statistics

JR Jeffrey Rogers MR Muthuswamy Raveendran RH R. Alan Harris TM Thomas Mailund KL Kalle Leppälä GA Georgios Athanasiadis MS Mikkel Heide Schierup JC Jade Cheng KM Kasper Munch JW Jerilyn A. Walker MK Miriam K. Konkel VJ Vallmer Jordan CS Cody J. Steely TB Thomas O. Beckstrom CB Christina Bergey AB Andrew Burrell DS Dominik Schrempf AN Angela Noll MK Maximillian Kothe GK Gisela H. Kopp YL Yue Liu SM Shwetha Murali KB Konstantinos Billis FM Fergal J. Martin MM Matthieu Muffato LC Laura Cox JE James Else TD Todd Disotell DM Donna M. Muzny JP Jane Phillips-Conroy BA Bronwen Aken EE Evan E. Eichler TM Tomas Marques-Bonet CK Carolin Kosiol MB Mark A. Batzer MH Matthew W. Hahn JT Jenny Tung DZ Dietmar Zinner CR Christian Roos CJ Clifford J. Jolly RG Richard A. Gibbs KW Kim C. Worley

This protocol is extracted from research article:

The comparative genomics and complex population history of Papio baboons

**
Sci Adv**,
Jan 30, 2019;
DOI:
10.1126/sciadv.aau6947

The comparative genomics and complex population history of Papio baboons

Procedure

Admixture graphs (*52*) model the ancestry of a set of samples in the form of a directed acyclic graph where edges capture drift along ancestral lineages, leaves represent the samples, and inner nodes represent either most recent common ancestral populations or admixture events where a new population is created as a mixture of two other populations. Hence, these graphs can capture more complex histories than simple tree phylogenies, but only simple forms of gene flow. Admixture graphs model all gene flow as admixture events and cannot easily model periods of continuous gene flow. Admixture graphs are parameterized by edge lengths (the amount of drift that occurred on a given ancestral lineage) and admixture proportions (how much of an ancestral admixed population was derived from one donor population rather than another). Properties of an admixture graph can be captured by so-called *f*-statistics (*52*), and these can be estimated from genomic data. This makes it possible to compare the statistics predicted by a graph, *F*, with statistics estimated from data, *f*. Graph parameters are estimated by minimizing the distance between *F* and *f*. We have implemented an R package (https://github.com/mailund/admixture_graph) for inferring graphs and graph parameters from vectors of observed statistics and applied this approach to the baboon data.

We used qpDstats from the ADMIXTOOLS package to compute estimates of *f* with corresponding *Z* values for all quartets (*W, X, Y, Z*), where *W* was rhesus macaque and *X, Y,* and *Z* were all combinations of baboon samples from three different species. The bulk of our analyses of *f*-statistics used sequence diversity and SNV data based on mapping reads to the Panu_2.0 genome assembly. Once the improved Panu_3.0 assembly was complete, which closed thousands of gaps in scaffolds but increased the total sequence length of the assembly by only 0.37%, we retested the likelihoods of the inferred phylogenetic relationships among lineages and the inferred history of admixture using SNVs called based on Panu_3.0. All conclusions regarding species phylogeny and admixture events that were based on Panu_2.0 data were confirmed when tested using SNV and diversity information based on read mapping against Panu_3.0. This is likely because the upgrade from Panu_2.0 to Panu_3.0 added little to the total sequence length.

*Grouping of samples*. When estimating *f*-statistics, we can either pool samples from the same population together or compute at the level of individual samples. Pooling samples would potentially give a better estimate of population allele frequencies, but can mask within-population differences that might be informative about recent gene flow. We therefore chose to estimate the statistics at the individual sample level. For this analysis, we estimated the *f*_{4}-statistics for all triplets of baboon samples combined with rhesus macaque. These were computed using the qpDstats tool from the ADMIXTOOLS package (https://github.com/DReichLab/AdmixTools). Inferring admixture graphs with each sample as a leaf is computationally intractable. The number of possible graphs grows superexponentially with the number of leaves, and our brute-force approach to exploring the graph space only scales to a small number of leaves. We therefore chose to keep the graphs at the species level. This way, samples from the same species are expected to have the same relationship to all other species. Parameters are estimated from the combined set of individual samples within each species.

We inspected the estimated *f*-statistics to ensure that the species grouping matched with similar vectors of statistics. In general, samples from the same species had very similar *f*-statistics compared to all other samples. The single exception was *P. anubis* sample 30877 from the Aberdare region of Kenya that has a different profile from the other olive baboons. This sample shows evidence of a recent admixture with *P. cynocephalus*. To focus on the ancient admixture events only, we removed this sample from the admixture graph analysis; thus, the species “anubis” in the following refers only to the remaining *P. anubis* samples.

*Fitting admixture graphs*. A given graph topology specifies a polynomial of edge lengths and admixture proportions as the expected value for each *f*-statistic. These polynomials are linear with respect to the edge lengths only, and so we stored them as rows in a matrix of polynomials of admixture proportions only. To measure the fit between a graph and the observed statistics, we defined a cost function: a weighted sum of squared errors between graph predictions *F* and statistics *f*. The weights are reciprocals of the SDs of the statistics *f*, given by ADMIXTOOLS (as the *Z* values divided by *f*). We fitted the parameters of the graph using a mix of analytic and numerical optimization. After fixing the admixture proportions, the polynomial equations of predictions *F* are linear and thus solvable analytically. To optimize the admixture proportions, we used the Nelder-Mead package (https://cran.r-project.org/web/packages/neldermead/index.html).

*Exploring the space of admixture graph topologies*. We explored the space of graph topologies with a brute-force approach, bounding the number of allowed admixture events. Because of the large number of possible graphs, we could not exhaustively explore all the topologies including all the species (six baboon species and the rhesus macaque outgroup). Instead, we used various heuristics.

We could explore all the trees with seven leaves. For admixture graphs with a single admixture event, we could only explore all the graphs with six leaves; hence, we built all the graphs with one baboon species missing and then reinserted the missing species into the graph (such that inserting it cannot add a second admixture event). For additional admixture events, we took a greedy approach and explored all the graphs reachable by adding a new admixture event to the best graphs with one less event (although we have no guarantee that the optimal graph with *n* events is necessarily an extension in this way of the optimal graph with *n* - 1 events).

*Estimating graph parameters and testing models*. We developed a Metropolis-Hastings MCMC (Markov Chain Monte Carlo) to sample the posterior of graph parameters given observed *f*-statistics. By sampling the posterior of graph parameters, we obtained estimates of admixture proportions and the uncertainty in these estimates. In addition, we can use the posterior samples in an importance sampler to obtain the likelihood of the observed statistics given a graph topology by integrating over the parameters of the graph.

*Sampling graph parameters*. Let *D* denote the observed data (the estimated *f*-statistics), *T* a given graph topology, and θ the graph parameters of graph *T* (edge lengths and admixture proportions). The purpose of the MCMC is to sample over the posterior distribution of graph parameters given the observed data and the graph topology, i.e., sample from *p*(θ|*D*, *T*). Since we can compute the likelihood of a parameter point, *p*(*D*|θ, *T*), up to a normalization factor, we can use the Metropolis-Hastings algorithm to construct an MCMC that samples over the posterior distribution. We transformed the parameter space to make the proposal distribution symmetric (thus, the acceptance probability for the Metropolis-Hastings step is simply the posterior ratio) and to ensure that the parameters are all legal for their interpretation in the graph framework. Edge lengths, which must be positive, were log-transformed, and admixture proportions, which must fall within the unit interval, were transformed with the inverse normal cumulative distribution function. After transformation, we used an adaptive algorithm to construct the proposal distribution based on correlations in the previously sampled variables. For each graph, we ran three independent chains with 10,000 steps and we checked convergence by comparing the distributions from the independent chains. Since edge lengths are measured in terms of drift, they do not have a simple interpretation as time parameters. We therefore considered them as nuisance parameters in the MCMC analysis. We used them to test convergence of the Markov chains, but in the results presented here, we focused on estimates of admixture proportions.

*Calculating graph topology likelihoods*. To compare two graph topologies, we can compute the topology likelihood *p*(*D*|*T*) for a given topology *T*. Using this likelihood estimate, we can compare two topologies using the Bayes factor ${\mathit{K}}_{{\mathit{T}}_{1},{\mathit{T}}_{2}}=\frac{\mathit{p}(\mathit{D}|{\mathit{T}}_{1})}{\mathit{p}(\mathit{D}|{\mathit{T}}_{2})}$, which captures the relative support the data provides for one topology over another: If, e.g., ${\mathit{K}}_{{\mathit{T}}_{1},{\mathit{T}}_{2}}=10$, we would consider *T*_{1} the more likely topology unless a priori topology *T*_{2} was at least 10 times more likely than *T*_{1}.

Given data *D* and topology *T*, the likelihood of *T* is computed by integrating over all the graph parameters for the topology *p*(*D*|*T*) = ∫*p*(*D*, θ|*T*) *d*θ = ∫*p*(*D*|θ, *T*) *p*(θ|*T*) *d*θ. This integral can be approximated by sampling from the prior distribution of graph parameters and computing the mean likelihood, $\int \mathit{p}(\mathit{D}|\mathrm{\theta},\mathit{T})\mathit{p}(\mathrm{\theta}|\mathit{T})\mathit{d}\mathrm{\theta}\approx \frac{1}{\mathit{N}}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{N}}}\mathit{p}(\mathit{D}|{\mathrm{\theta}}_{\mathit{i}},\mathit{T})$, where θ_{i} ~ *p*(θ|*T*), but this estimator has a large variance since most parameters drawn from the prior distribution have a very low likelihood. Instead, we used the samples from the posterior distribution to estimate the likelihood using an importance sampler.

Now, because $\int \frac{\mathit{p}(\mathrm{\theta}|\mathit{D},\mathit{T})}{\mathit{p}(\mathit{D}|\mathrm{\theta},\mathit{T})}\mathit{d}\mathrm{\theta}=\int \frac{\mathit{p}(\mathit{D},\mathrm{\theta}|\mathit{T})/\mathit{p}(\mathit{D}|\mathit{T})}{\mathit{p}(\mathit{D},\mathrm{\theta}|\mathit{T})/\mathit{p}(\mathrm{\theta}|\mathit{T})}\mathit{d}\mathrm{\theta}=\int \frac{\mathit{p}(\mathrm{\theta}|\mathit{T})}{\mathit{p}(\mathit{D}|\mathit{T})}\mathit{d}\mathrm{\theta}=\frac{1}{\mathit{p}(\mathit{D}|\mathit{T})}\int \mathit{p}(\mathrm{\theta}|\mathit{T})\mathit{d}\mathrm{\theta}=\frac{1}{\mathit{p}(\mathit{D}|\mathit{T})}$, we can estimate ${[\mathit{p}(\mathit{D}|\mathit{T})]}^{-1}=\int \frac{1}{\mathit{p}(\mathit{D}|\mathrm{\theta},\mathit{T})}\mathit{p}(\mathrm{\theta}|\mathit{D},\mathit{T})\mathit{d}\mathrm{\theta}\approx \frac{1}{\mathit{N}}{\displaystyle \sum _{\mathit{i}=1}^{\mathit{N}}}{[\mathit{p}(\mathit{D}|{\mathrm{\theta}}_{\mathit{i}},\mathit{T})]}^{-1}$, where θ_{i} ~ *p*(θ|*D*, *T*).

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.