# Also in the Article

Analyses of admixture through f-statistics
This protocol is extracted from research article:
The comparative genomics and complex population history of Papio baboons
Sci Adv, Jan 30, 2019;

Procedure

Admixture graphs (52) model the ancestry of a set of samples in the form of a directed acyclic graph where edges capture drift along ancestral lineages, leaves represent the samples, and inner nodes represent either most recent common ancestral populations or admixture events where a new population is created as a mixture of two other populations. Hence, these graphs can capture more complex histories than simple tree phylogenies, but only simple forms of gene flow. Admixture graphs model all gene flow as admixture events and cannot easily model periods of continuous gene flow. Admixture graphs are parameterized by edge lengths (the amount of drift that occurred on a given ancestral lineage) and admixture proportions (how much of an ancestral admixed population was derived from one donor population rather than another). Properties of an admixture graph can be captured by so-called f-statistics (52), and these can be estimated from genomic data. This makes it possible to compare the statistics predicted by a graph, F, with statistics estimated from data, f. Graph parameters are estimated by minimizing the distance between F and f. We have implemented an R package (https://github.com/mailund/admixture_graph) for inferring graphs and graph parameters from vectors of observed statistics and applied this approach to the baboon data.

We used qpDstats from the ADMIXTOOLS package to compute estimates of f with corresponding Z values for all quartets (W, X, Y, Z), where W was rhesus macaque and X, Y, and Z were all combinations of baboon samples from three different species. The bulk of our analyses of f-statistics used sequence diversity and SNV data based on mapping reads to the Panu_2.0 genome assembly. Once the improved Panu_3.0 assembly was complete, which closed thousands of gaps in scaffolds but increased the total sequence length of the assembly by only 0.37%, we retested the likelihoods of the inferred phylogenetic relationships among lineages and the inferred history of admixture using SNVs called based on Panu_3.0. All conclusions regarding species phylogeny and admixture events that were based on Panu_2.0 data were confirmed when tested using SNV and diversity information based on read mapping against Panu_3.0. This is likely because the upgrade from Panu_2.0 to Panu_3.0 added little to the total sequence length.

Grouping of samples. When estimating f-statistics, we can either pool samples from the same population together or compute at the level of individual samples. Pooling samples would potentially give a better estimate of population allele frequencies, but can mask within-population differences that might be informative about recent gene flow. We therefore chose to estimate the statistics at the individual sample level. For this analysis, we estimated the f4-statistics for all triplets of baboon samples combined with rhesus macaque. These were computed using the qpDstats tool from the ADMIXTOOLS package (https://github.com/DReichLab/AdmixTools). Inferring admixture graphs with each sample as a leaf is computationally intractable. The number of possible graphs grows superexponentially with the number of leaves, and our brute-force approach to exploring the graph space only scales to a small number of leaves. We therefore chose to keep the graphs at the species level. This way, samples from the same species are expected to have the same relationship to all other species. Parameters are estimated from the combined set of individual samples within each species.

We inspected the estimated f-statistics to ensure that the species grouping matched with similar vectors of statistics. In general, samples from the same species had very similar f-statistics compared to all other samples. The single exception was P. anubis sample 30877 from the Aberdare region of Kenya that has a different profile from the other olive baboons. This sample shows evidence of a recent admixture with P. cynocephalus. To focus on the ancient admixture events only, we removed this sample from the admixture graph analysis; thus, the species “anubis” in the following refers only to the remaining P. anubis samples.

Fitting admixture graphs. A given graph topology specifies a polynomial of edge lengths and admixture proportions as the expected value for each f-statistic. These polynomials are linear with respect to the edge lengths only, and so we stored them as rows in a matrix of polynomials of admixture proportions only. To measure the fit between a graph and the observed statistics, we defined a cost function: a weighted sum of squared errors between graph predictions F and statistics f. The weights are reciprocals of the SDs of the statistics f, given by ADMIXTOOLS (as the Z values divided by f). We fitted the parameters of the graph using a mix of analytic and numerical optimization. After fixing the admixture proportions, the polynomial equations of predictions F are linear and thus solvable analytically. To optimize the admixture proportions, we used the Nelder-Mead package (https://cran.r-project.org/web/packages/neldermead/index.html).

Exploring the space of admixture graph topologies. We explored the space of graph topologies with a brute-force approach, bounding the number of allowed admixture events. Because of the large number of possible graphs, we could not exhaustively explore all the topologies including all the species (six baboon species and the rhesus macaque outgroup). Instead, we used various heuristics.

We could explore all the trees with seven leaves. For admixture graphs with a single admixture event, we could only explore all the graphs with six leaves; hence, we built all the graphs with one baboon species missing and then reinserted the missing species into the graph (such that inserting it cannot add a second admixture event). For additional admixture events, we took a greedy approach and explored all the graphs reachable by adding a new admixture event to the best graphs with one less event (although we have no guarantee that the optimal graph with n events is necessarily an extension in this way of the optimal graph with n - 1 events).

Estimating graph parameters and testing models. We developed a Metropolis-Hastings MCMC (Markov Chain Monte Carlo) to sample the posterior of graph parameters given observed f-statistics. By sampling the posterior of graph parameters, we obtained estimates of admixture proportions and the uncertainty in these estimates. In addition, we can use the posterior samples in an importance sampler to obtain the likelihood of the observed statistics given a graph topology by integrating over the parameters of the graph.

Sampling graph parameters. Let D denote the observed data (the estimated f-statistics), T a given graph topology, and θ the graph parameters of graph T (edge lengths and admixture proportions). The purpose of the MCMC is to sample over the posterior distribution of graph parameters given the observed data and the graph topology, i.e., sample from p(θ|D, T). Since we can compute the likelihood of a parameter point, p(D|θ, T), up to a normalization factor, we can use the Metropolis-Hastings algorithm to construct an MCMC that samples over the posterior distribution. We transformed the parameter space to make the proposal distribution symmetric (thus, the acceptance probability for the Metropolis-Hastings step is simply the posterior ratio) and to ensure that the parameters are all legal for their interpretation in the graph framework. Edge lengths, which must be positive, were log-transformed, and admixture proportions, which must fall within the unit interval, were transformed with the inverse normal cumulative distribution function. After transformation, we used an adaptive algorithm to construct the proposal distribution based on correlations in the previously sampled variables. For each graph, we ran three independent chains with 10,000 steps and we checked convergence by comparing the distributions from the independent chains. Since edge lengths are measured in terms of drift, they do not have a simple interpretation as time parameters. We therefore considered them as nuisance parameters in the MCMC analysis. We used them to test convergence of the Markov chains, but in the results presented here, we focused on estimates of admixture proportions.

Calculating graph topology likelihoods. To compare two graph topologies, we can compute the topology likelihood p(D|T) for a given topology T. Using this likelihood estimate, we can compare two topologies using the Bayes factor $KT1,T2=p(D|T1)p(D|T2)$, which captures the relative support the data provides for one topology over another: If, e.g., $KT1,T2=10$, we would consider T1 the more likely topology unless a priori topology T2 was at least 10 times more likely than T1.

Given data D and topology T, the likelihood of T is computed by integrating over all the graph parameters for the topology p(D|T) = ∫p(D, θ|T) dθ = ∫p(D|θ, T) p(θ|T) dθ. This integral can be approximated by sampling from the prior distribution of graph parameters and computing the mean likelihood, $∫p(D|θ,T)p(θ|T)dθ≈1N∑i=1Np(D|θi,T)$, where θi ~ p(θ|T), but this estimator has a large variance since most parameters drawn from the prior distribution have a very low likelihood. Instead, we used the samples from the posterior distribution to estimate the likelihood using an importance sampler.

Now, because $∫p(θ|D,T)p(D|θ,T)dθ=∫p(D,θ|T)/p(D|T)p(D,θ|T)/p(θ|T)dθ=∫p(θ|T)p(D|T)dθ=1p(D|T)∫p(θ|T)dθ=1p(D|T)$, we can estimate $[p(D|T)]−1=∫1p(D|θ,T)p(θ|D,T)dθ≈1N∑i=1N[p(D|θi,T)]−1$, where θi ~ p(θ|D, T).

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A