# Also in the Article

Conversion of simulations to genotype matrices
This protocol is extracted from research article:
Detecting adaptive introgression in human evolution using convolutional neural networks
eLife, May 25, 2021;

Procedure

We converted the tree sequence files from the simulations into genotype matrices using the tskit Python API (Kelleher et al., 2016). Major alleles (those with sample frequency greater than 0.5 after merging all individuals) were encoded in the matrix as 0, while minor alleles were encoded as 1. In the event of equal counts for both alleles, the major allele was chosen at random. Only sites with a minor allele frequency >5% were retained. For sweep and AI simulations, we excluded the site of the selected mutation.

We note that different simulations result in different numbers of segregating sites, but a constraint for efficient CNN training is that each datum in a batch must have the same dimensions. Existing approaches to solve this problem are to use only a fixed number of segregating sites (Chan et al., 2018), to pad the matrix out to the maximum number of observed segregating sites (Flagel et al., 2019), or to use an image-resize function to constrain the size of the input data (Torada et al., 2019). Each approach discards spatial information about the local density of segregating sites, although this may be recovered by including an additional vector of inter-site distances as input to the network (Flagel et al., 2019).

To obtain the benefits of image resizing (fast training times for reduced sizes and easy application to genomic windows of a fixed size), while avoiding its drawbacks, we chose to resize our input matrices differently, and only along the dimension corresponding to sites. To resize the genomic window to have length $m$, the window was partitioned into $m$ bins, and for each individual haplotype we counted the number of minor alleles observed per bin. Compared with interpolation-based resizing (Torada et al., 2019), binning is qualitatively similar, but preserves inter-allele distances and thus the local density of segregating sites. Furthermore, as we do not resize along the dimension corresponding to individuals, this also permits the use of permutation-invariant networks (Chan et al., 2018), although we do not pursue that network architecture here.

We report results for $m=256$, but also tried $m=32$, 64, and 128 bins. Preliminary results indicated greater training and validation accuracy for CNNs trained with more bins, around 1% difference between both 32 and 64, and 64 and 128, although only marginal improvement for 256 compared with 128 bins. When matching unphased data, we combined genotypes by summing minor allele counts between the chromosomes of each individual. We note that all data were treated as either phased, or unphased, and no mixed phasing was considered.

We then partitioned the resized genotype matrix into submatrices by population. Submatrices were ordered left-to-right according to the donor, recipient, and unadmixed populations respectively. For genotype matrices including both Neanderthals and Denisovans, we placed the non-donor archaic population to the left of the donor. To ensure that a non-permutation-invariant CNN could learn the structure in our data, we sorted the haplotypes (Flagel et al., 2019; Torada et al., 2019). The resized haplotypes/individuals within each submatrix were ordered left-to-right by decreasing similarity to the donor population, calculated as the Euclidean distance to the average minor-allele density of the donor population (analogous to a vector of the donor allele frequencies). An example (phased) genotype matrix image for an AI simulation is shown in Figure 1.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A