Timing. To obtain timings for our method and LDhat for a realistic use case, we computed the time it took to infer a recombination map for chromosome-scale data. Both methods make use of the same precomputed lookup table of two-locus likelihoods, and so we did not benchmark the creation of those tables, which has been done previously (23). Thus, we compare only the amount of time to infer a recombination map. Using msprime (59), we simulated 10 replicates of data matching the length of chromosome 1 with the HapMap recombination map (34) and under the demography inferred for CEU, for a sample size of n = 196 haploids. Because LDhat does not allow parallelization, we wrote a python script to separate the data into the same overlapping windows used in our method (windows of 4001 SNPs overlapping by 100 SNPs). We ran our method pyrho using 32 cores and also used 32 cores to parallelize LDhat runs. For the LDhat runs, we then used a python script to combine the output of the runs. Because our scripts for splitting and combining the data for LDhat are not optimized, we only timed the total runtime of LDhat and compared that to the total time pyrho required, which is slightly advantageous for LDhat. We used the “optimal” hyperparameters for pyrho as discussed below and used the default parameters for LDhat, which were tuned to a human-like setting. The timings are presented in fig. S6, showing that in our simulations pyrho was, on average, at least 10 times faster than LDhat. Yet, when generating the 1KG maps, LDhat was run on windows of 2000 SNPs, and the MCMC was run for 22.5 million iterations per window, whereas we used only 1 million iterations per window in our timing benchmark. Computing the recombination maps for chromosome 1 for 1KG thus likely took between 22.5 and 45 times longer than the results reported here, suggesting that our method is closer to between 225 and 450 times faster.

Accuracy on simulated data. To assess the accuracy of our method, we used msprime (59) to simulate 100 sequences of 1 Mb with recombination maps randomly drawn from the HapMap recombination map (34) under the demography inferred for CEU. We then used the lookup table generated for CEU, which takes demography into account, for pyrho, while using a constant-demography lookup table for LDhat, as is the default for that program. For each simulation, we took the middle 500 kb and computed the correlation between the true recombination map and the inferred recombination map. We computed the Pearson correlation in both natural and log scale, and also the Spearman correlation. To avoid issues with autocorrelation, we look at windows centered at every 10,000th position. To assess the correlation at different spatial scales, we considered windows of different sizes [1 base pair (bp), 1 kb, and 10 kb].

We also performed simulations as above but with smaller sample sizes, n ∈ {20,40,60,80,100,120} and per-generation mutation rates a factor of 10 times lower or higher than humans, μ ∈ {1.25 × 10−9,1.25 × 10−8,1.25 × 10−7}, to show the applicability of pyrho to other species and sample sizes. In all cases, we used the demography for CEU and scaled the fine-scale recombination rates such that the ratio of the mutation and recombination rates remained fixed across different mutation rates. We performed hyperparameter optimization as described below for each sample size and mutation rate combination and then ran pyrho. The results are presented in fig. S1C. Overall, we find that the method performs better for species with higher levels of diversity (i.e., those with a larger mutation rate) and for larger sample sizes, although with some diminishing returns.

We also investigated whether the differences between pyrho and LDhat are due to the optimization scheme (i.e., fused-LASSO versus MCMC) or due to the effect of taking demographic history into account. Using the same simulations as described above, we reran LDhat using the demography-aware lookup table used by pyrho and computed the same measures of correlation between these inferred maps and the true maps. We found that at fine scales pyrho outperforms LDhat by any measure regardless of whether LDhat used a constant-demography lookup table or the demography-aware lookup table. At broader scales, pyrho outperformed LDhat if LDhat used a constant-demography lookup table but performed comparably to LDhat using the demography-aware lookup table. Meanwhile, the demography-aware version of LDhat outperformed the version of LDhat that assumed a constant-demography at all scales. The results are summarized in table S1.

Comparison of r2 on the 1000 Genomes Project dataset. To get a sense of accuracy on real data, we computed a measure of linkage disequilibrium, r2, between pairs of nearby SNPs. We used vcftools (60) with the --hap-r2, −-ld-window 15, −-thin 2000,--maf 0.1, and --max-missing 1 flags. Briefly, this removes all missing data, removes SNPs until they are all separated by at least 2 kb, and removes SNPs with a minor allele frequency (MAF) less than 10% and then computes the r2 for all SNPs within 15 SNPs of each other. For each pair of SNPs, we then computed the recombination rate between them as determined by a given fine-scale recombination map. We sorted the pairs of SNPs by the recombination rate between them, grouped them into bins of 1000 pairs of SNPs, and reported the empirical deciles from that bin. We compared these against the theoretical deciles of the distribution of r2 for sample sizes matched to the observed sample sizes for SNPs with MAF greater than 10%, which we computed from the lookup tables we generated as discussed below. The results for YRI are presented in Fig. 1B, and the results for CEU and CHB are presented in fig. S2. See Table 1 for a list of three-letter population codes.

The super-populations are African (AFR), admixed American (AMR) East Asian (EAS), European (EUR), and South Asian (SAS).

To determine whether the differences between maps are statistically significant, we computed the mean square error between the empirical and theoretical deciles, averaged across all of the bins of pairs of SNPs for different maps. To compare two maps, we used the difference in their mean square error as a test statistic and obtained a null distribution by performing 1 million permutations of the bins (i.e., randomly assigning each bin to one map or the other, making sure that each map has the correct number of bins). We compared the maps that we inferred to the linkage disequilibrium-based maps, HapMap (34) and 1KG (15); a trio-based map, DECODE (28); and an admixture-based map (26). We performed this comparison for CEU, CHB, and YRI using the appropriate population for our population-specific maps and the population-specific maps of 1KG. That is, overall, we performed 12 comparisons (comparing our maps against four others in three different populations). For each comparison, we found that our recombination maps had a lower mean square error between the empirical and theoretical deciles with no permutations providing an equal or greater improvement, conservatively implying P < 1 × 10−5 for each comparison.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.