Computational Advances Relating to Detecting Archaic Admixture

K D Ahlquist; Mayra M Bañuelos; Alyssa Funk; Jiaying Lai; Stephen Rong; Fernando A Villanea; Kelsey E Witt

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Computational Advances Relating to Detecting Archaic Admixture

KA K D Ahlquist

MB Mayra M Bañuelos

AF Alyssa Funk

JL Jiaying Lai

SR Stephen Rong

FV Fernando A Villanea

KW Kelsey E Witt

This method is extracted from research article: Genome Biol Evol, May 2021

Our Tangled Family Tree: New Genomic Methods Offer Insight into the Legacy of Archaic Admixture

DOI: 10.1093/gbe/evab115

Request a Protocol

Ask a question

Favorite

These include: 1) methods related to the Ancestral Recombination Graph (ARG) and genealogical inference; and 2) methods that make use of machine learning methods and approximate Bayesian computation (ABC).

Methods related to the ARG have expanded significantly in the last few years. Theoretically, inferring the ARG means inferring the complete genealogical history, including recombination and coalescence, for every piece of the genome for all sampled individuals (fig. 2c). Full knowledge of the ARG would provide all the information available in a set of genomes about the history of those lineages, including events such as admixture, demographic changes, and recombination. An ARG also may also reveal information about selection and adaptation over time. In practice, tracking the history of each recombinant fragment, and storing such a large amount of information, is a herculean task. As such, all of the ARG-based methods discussed below make use of some simplifications or heuristic approaches to provide approximations of the full theoretical ARG. ARGWeaver (Rasmussen et al. 2014) allows for ARG inference with a sample size of tens of mammalian genomes. ARGWeaver-D (Hubisz et al. 2020) builds on the ARGWeaver model, allowing for tracing the origin of genomic segments through the inferred ARG under a user-specified demographic model, and allowing the user to include heterochronous samples. The ability to trace the origin of genomic segments allows ARGWeaver-D to ascribe specific ancestry to genomic segments in modern human genomes, as well as in Neanderthal and Denisovan genomes, even identifying portions of the Denisovan genome which originated from an unknown, superarchaic human population. However, applying ARGWeaver-D is very computationally expensive, and the complexity of demographic models that can be considered is limited. ARGWeaver-D also can only be applied to tens of individuals at most.

Additional methods give insight into the ARG while taking advantage of simplifications that allow for scalability and computational efficiency. Relate (Speidel et al. 2019) presents an efficient method to produce genealogies for each site in the genome. Relate first constructs genealogical trees at each site, building on the HMM process described by Li and Stephens (2003). Coalescent times are then inferred in a separate process using an iterative Markov Chain Monte Carlo algorithm. The process is designed to scale to over ten thousand individuals. Using a different set of simplifications based on the Li and Stephens model, tsinfer (Kelleher et al. 2019) leverages the tree sequence data structure, a method that can efficiently record genealogical trees for each genomic site, by recognizing that genealogies at adjacent genomic sites are highly correlated. The tree-sequence encoding stores edges of adjacent, correlated trees just once, allowing for efficient storage of information, and enabling fast calculation of many tree-sequence features, and scaling to over a hundred thousand individuals. Although tree sequences provide a computationally efficient approximation of much of the information contained within the complete ARG, the model makes simplifying assumptions, including a single origin for mutation (no recurrent mutations or back mutation), and an assumption that variant age can be approximated by variant frequency. Tree sequence construction is also vulnerable to errors in phasing and sequencing, and requires high quality phased data as an input. Both Relate and tsinfer can be used to detect archaic admixture in large population genomic data sets by identifying sites in the genome with exceptionally long branch lengths or long time to the most recent common ancestor (TMRCA) in the ARG (fig. 2d). Additional data from archaic genomes can allow for discovery and validation of archaic ancestry segments. These methods also have the potential to more explicitly incorporate heterochronous sampling as part of the process of constructing a complete population history. Forthcoming extensions of these methods allow for inclusion of ancient and archaic samples to improve estimation of genealogies and the timing of events (Speidel et al. 2021; Wohns et al. 2021).

Machine learning methods and ABC have become prominent features in recent publications on the detection of archaic admixture. Both methods take advantage of fast, efficient software for population genetic simulation (Kelleher et al. 2016; Haller et al. 2019) to sample model parameters from a prior distribution (fig. 2e). Both machine learning and ABC use the information from simulations to infer population genetic parameters that fit the genomic data. ABC is a likelihood-free inference method that uses summary statistics as input. Summary statistics from genomic and simulated data are compared to find the combination of simulated parameters that yield simulated summary statistics that are closest to the summary statistics of the genomic data. If the distance between the genomic and simulated summary statistics is below a predefined tolerance, the model parameters are accepted. Otherwise they are rejected, but the closest model parameters are used to update additional rounds of simulation (Villanea et al. 2020). However, ABC is usually based on hand-selected summary statistics, and often requires substantial investment in computational resources (usually >10⁶ individual simulations) to perform accurate parameter inference (Beaumont et al. 2002; Sousa et al. 2009).

Supervised machine learning also takes advantage of fast, efficient methods for population genetic simulation to find the population parameters that produce simulated data most similar to the observed data set (fig. 2e). In supervised learning, simulated data are partitioned into training and test sets, and a variety of learning algorithms are used to classify the data and make inferences (for a comparison of different learning algorithms, see Caruana et al. 2008). Supervised learning can be applied in a genome scan approach to localize archaic introgression (Gower et al. 2021). ArChie (Durvasula and Sankararaman 2019) uses logistic regression on a preselected set of summary statistics to distinguish AMH-derived haplotypes from those that derive from other archaic human populations. Similarly, FILET (Schrider et al. 2018) uses the extra trees classifier on a preselected set of summary statistics to identify introgressed loci (fig. 2e). Inference directly from sequence data, without the need for summary statistics, may also be possible in future work (Chan et al. 2018; Flagel et al. 2019). Supervised learning can also be applied to perform demographic model selection. Villanea and Schraiber (2019) used deep learning to match summaries of observed data with models that consider either single or multiple archaic admixture events. Supervised machine learning often requires a fraction of the simulations used in ABC. However, supervised machine learning does not usually provide inference of meaningful posterior probabilities.

To ease these shortcomings, the advantages of both ABC and supervised machine learning have been combined in recent work (ABC-DL, Lorente-Galdos et al. 2019; Mondal et al. 2019). Their goal was to reduce the volume of simulations required by letting supervised learning produce refined summary statistics that maximize information from fewer simulations, and then used these refined summary statistics in ABC to infer posterior distributions. This approach also negates a major weakness of supervised machine learning, as it allows for the quantification of uncertainty through the inference of posterior probability distributions.

The advances represented by these methods have revolutionized our understanding of archaic admixture at a rapid pace (for a summary, see table 1, fig. 2). In the near future, we predict significant expansions of both data sources and methods, which will open new lines of inquiry and give new insight into the legacy of archaic admixture (see Conclusions for further discussion). Next, we focus on two areas where novel results have already changed our biological understanding in the last decade: Demographic models of human ancestry in the African continent; and clarifying the effects of functional regions influenced by archaic alleles.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol