Analysis of Empirical Data

Paolo Momigliano; Ann-Britt Florin; Juha Merilä

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Analysis of Empirical Data

PM Paolo Momigliano

AF Ann-Britt Florin

JM Juha Merilä

This method is extracted from research article: Mol Biol Evol, Feb 2021

Biases in Demographic Modeling Affect Our Understanding of Recent Divergence

DOI: 10.1093/molbev/msab047

Request a Protocol

Ask a question

Favorite

We obtained a total of 172 samples of S. maximus from seven locations and three biogeographic regions (supplementary table 2, Supplementary Material online and fig. 5A): the North Sea (one location, N = 20), the transition zone separating the North Sea from the Baltic Sea (two sampling locations, Vendelsö: N = 27 and Öresund: N = 35), and the Baltic Sea (four populations, Dabki: N = 24, Gdynia: N = 23, Gotland: N = 20 and Gotska Sandön: N = 24). Individuals from the transition zone and the Baltic Sea are a subsample of the individuals analyzed by Florin and Höglund (2007), whereas samples from the North Sea were originally collected by Nielsen et al. (2004).

We built 2b-RAD libraries following the approach described by Wang et al. (2012), but with degenerate adaptors to allow identification of PCR duplicates. The protocol is described in detail by Momigliano et al. (2018). In short, DNA was extracted using a modified salting out protocol, and about 200 ng of DNA was digested with the type II b enzyme BcgI (New England Biolabs). This enzyme cuts both upstream and downstream of the 6-bp recognition site, creating fragments of a length of exactly 36 bp with 2-bp overhangs. Adaptors, one of which included degenerate bases, were ligated and the fragments amplified via ten cycles of PCR as described in Momigliano et al. (2018). Fragments of the expected size were isolated using a BluePippin machine (Sage Science). Libraries were sequenced on Illumina machines (NextSeq500 and Hiseq 4,000) to achieve a mean coverage of approximately 20×.

Raw reads were demultiplexed and PCR duplicates were removed as per Momigliano et al. (2018). Reads were then mapped to the latest version of S. maximus reference genome (Figueras et al. 2016, Assembly ASM318616v1, GenBank accession number: GCA003186165) using Bowtie2 (Langmead and Salzberg 2012). SAM files were converted to BAM files and indexed using SAMTOOLS (Li et al. 2009). A genotype likelihood file in beagle format was produced in the software ANGSD (Korneliussen et al. 2014), using the following filters: no more than 20% missing data, retaining only biallelic loci, removing bases with mapping quality below 30 and Phred quality scores below 20. See supplementary figure 1, Supplementary Material online, for summary statistics. A principal component analysis (PCA) based on genotype likelihoods was performed using the software PCAngsd (Meisner and Albrechtsen 2018) using only variants with a minor allele frequency above 0.02. The folded AFS for each population as well as the jAFS were produced in ANGSD. ANGSD calculates folded jAFS where the minor allele is computed separately for each AFS, whereas moments expects minor alleles to be estimate for the jAFS. Thus, for the jAFS for North Sea and Baltic Sea, we produced the unfolded jAFS in ANGSD in the form of a dadi data dictionary. We then selected a random SNP within each locus and folded the jAFS in moments.

We produced also a variant call file (VCF) using the UnifiedGenotyper function from GATK v.3.8. Following UnifiedGenotyper, we removed individuals with an average read depth below seven. Then, we used four technical replicate pairs (i.e., four pairs of individuals for which we constructed and sequenced two independent libraries) to generate a list of SNPs for which we have high confidence (i.e., which show 100% matches between all replicate pairs). We used this list of SNPs to carry out variant quality score recalibration (VQSR), following GATK best practice (Dixon et al. 2015). Finally, we removed genotype calls with low sequencing depth (<7), removed indels, triallelic SNPs, SNPs with minor allele frequencies below 0.01 and with more than 10% missing data. This resulted in a final VCF containing 12,678 sites genotyped for 154 individuals. This data set was used to calculate Weir and Cockerham F_ST for each SNP. Furthermore, we thinned the data retaining, for each tag, the SNP with the highest minor allele frequency and used this data for inferring population structure using fastSTRUCTURE (Raj et al. 2014). Bioinformatic steps, scripts for analyses, and the jAFS used are publicly available (see Data Availability).

We used several approaches to identify potential islands of differentiation across the genome between North Sea and Baltic Sea turbots. Firstly, we calculated from the called genotypes F_ST between the North Sea and Baltic (excluding individuals from the transition zone) using VCFTOOLS (Danecek et al. 2011). This is the only measure of differentiation we calculated from called genotypes. Secondly, we used the software package PCAngsd to run a selection scan using an extended model of fastPCA (Galinsky et al. 2016) working directly on genotype likelihoods, based on the input beagle file we used for the PCA. This approach identifies variants whose differentiation along a specific principal component (in our case the first PC) is greater than the null distribution under genetic drift. To account for multiple comparisons, we converted P values to q-values (false discovery rate, FDR) following Benjamini and Hochberg (1995). As a second approach to classify SNPs as outliers, we used a Hidden Markov model (HMM) approach to detect genomic islands, based on the uncorrected P values from PCAngsd selection scan (Hofer et al. 2012; Marques et al. 2016). We counted as candidate outliers SNPs that show and FDR <0.1 and that simultaneously were identified as outliers by the HMM test. If adjacent SNPs identified by both approaches lied within a distance of <500 kb, we identified them as part of the same genomic island of differentiation.

We then obtained estimates of within population genetic diversity (π) and absolute divergence (d_xy) across the genome. In order to maximize the usable data and account for differences in coverage among samples, we performed all analyses in ANGSD directly from genotype likelihoods. Firstly, we generated windows of 250 kb across the genome using BEDTools (Quinlan and Hall 2010). Then, we calculated the unfolded AFS (for the North Sea and Baltic Sea individuals) as well as the jAFS for each 250-kb window across the genome with ANGSD, using only filters that do not distort the AFS (-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 20 -C 50). We finally used custom R scripts to calculate π and d_xy for each window, retaining only windows for which the AFS was derived from at least 1,000 sequenced bases. All scripts to calculate the AFS in windows and derive summary statistics are available from GitHub and Zenodo (sea Data Availability). Note that, for diversity analyses, we used the unfolded AFS even if we do not have an appropriate outgroup, assuming the reference allele as the ancestral state. However, this is not an issue since the summary statistics calculated are based on allele frequencies, which are symmetrical with respect to folding.

The demographic history of the North Sea and Baltic Sea populations of S. maximus were reconstructed using several approaches based on the 1d-AFS (for one-population models) and the jAFS (for two-population models). Since taking into account possible changes in effective population size may have important effects on the estimation of parameters such as migration rate and divergence times (Gravel et al. 2011), we first determined the demographic history of each population independently using the 1d-AFS, using both moments and stairway plots (Liu and Fu 2015). We then proceeded to compare 32 two-population models to determine the demographic history from the jAFS. We used a µ of $1 \times 10^{- 8}$ and a generation time of 3.5 years for scaling demographic events to make results directly comparable to Le Moan et al. (2019).

Firstly, we estimated past demographic changes in the North Sea populations and from the Baltic Sea populations (i.e., excluding samples from the transition zone) using the multi-epoch model implemented in the software Stairway Plot v2 (Liu and Fu 2015). Stairway plots use composite likelihood estimations of Θ at different epochs, which is then scaled using the mutation rate. For Baltic Sea populations, we estimated past demographic changes from the 1d-SFS from each sampling location independently. Stairway plots were generated including singletons, using 2/3 of the sites for training and four numbers of random break points for each try ( $\frac{1}{4}, \frac{1}{2}, \frac{2}{3}$ , and 1 time the number of samples in each population). Since demographic histories were similar in all locations, and there was no evidence of population structure from other analyses, all subsequent analyses were performed using the jAFS estimated from pooling all samples from populations in the Baltic Sea.

Secondly, we compared three simple one-population models using moments: a standard neutral model (SNM, assuming constant population size at equilibrium), a two-epoch model (2EP, assuming a sudden population change at time T1), and a three-epoch model including a sudden demographic change at time T_AE followed by a bottleneck at time T_B followed by exponential growth (3EP $_{B}$ ). The 2EP model represents a single demographic change and a scenario where a demographic expansion/contraction occurred either in the ancestral population from which the North Seaand Baltic populations are derived or in the North Sea and Baltic Sea populations themselves. The 3EP $_{B}$ model represents a scenario where, in addition to an ancestral expansion/contraction, there was a recent bottleneck followed by growth; this could be a realistic scenario for the Baltic Sea populations, which must have invaded the Baltic Sea following its connection to the North Sea within the past 8 kyr.

Given the existence of a well-known hybrid-zone between the Baltic Sea and the North Sea (Nielsen et al. 2004), we tested the two main divergence scenarios that include contemporary migration: the isolation with migration model (IM), and the secondary contact (SC) models. All models consist of an ancestral population of size N_ANC that splits into two populations of size N₁ and N₂ at time T_S. Migration is continuous and asymmetric in the IM model. The SC model includes a period of isolation starting at time T_S and a period of secondary contact when asymmetric migration starts at time T_SC. For each of these basic models, we tested models that included heterogeneous migration rates across the genome (2M, i.e., islands resisting migration), and heterogeneous N_e across the genome (2N, a way to model linked selection) as described in Rougeux et al. (2017). We therefore had four possible variations of each of the basic model (e.g., SC, SC $_{2 M}$ , SC $_{2 N}$ , SC $_{2 M 2 N}$ ), for a total of eight basic divergence models.

Both stairway plots and moments analyses of the empirical data suggest (see Results) a demographic expansion 20–100 kya, and that the Baltic Sea population may also have undergone a recent bottleneck. If such demographic changes had happened at the time of split between the two populations, this would be well captured by the eight models described above which take into account a single change in N_e from N_ANC to N₁ and N₂ at time T_S. However, given that the timing and magnitude of the population expansion are very similar in all populations, another possibility is that the ancestral population underwent a demographic expansion prior to the split between the North Sea and the Baltic Sea. We incorporated this hypothesis by extending the eight models described above to include an ancestral population expansion (AE, ancestral expansion models), a recent bottleneck followed by population growth in the Baltic Sea (B, bottleneck models) or both (AEB models). In the AE models, the ancestral population undergoes a demographic change at time T_AE, after which population size remains constant until time of split (T_S). In the B models, the Baltic Sea population undergoes a bottleneck followed by population growth at time T_S. This scenario mimics a possible invasion of the Baltic Sea from a small founder population. For both IM and SC models, all possible combinations of heterogeneous N_e (2N models), heterogeneous migration rates (2M models), ancestral expansion (AE models), and bottlenecks (B models) were tested, yielding 16 variations of the IM and SC models and a total of 32 models tested. It should be note that in 2N2M models, it is assumed that regions experiencing lower migration rates and regions experiencing lower effective population size do not overlap. This made convergence of the complex models easier, and it should not be problematic assuming the proportion of the genome experiencing lower migration rates and with lower N_e are relatively small (as is our case, see supplementary table 3, Supplementary Material online).

With the exception of the SNM models, which has no free parameters, all one- and two-population models were optimized in five independent optimization routines, each consisting of five rounds of optimization using an approach similar to Portik et al. (2017). In the first round, we generated 30 sets of three-fold randomly perturbed parameters and for each ran a BFGS optimization setting max-iter to 30. In the second round, we chose for each model the replicate with the highest likelihood from the first round and generated 20 sets of three-fold randomly perturbed parameters, followed by the same optimization strategy. We repeated the same procedure for rounds three, four, and five, but using two-fold (round three) and one-fold (round four and five) perturbed parameters, respectively. In the final round, we also estimated 95% confidence intervals (CI) of the estimated parameters using the Fisher Information Matrix, as described by Coffman et al. (2016), and 95% CI of T_S and T_SC were estimated according to the rules of propagation of uncertainty. We selected the best replicate for each of the final models (three one-population models and 32 two-population models). We then calculated W_AIC as a relative measure of support for each model. Parameters in coalescent units were scaled based on estimates of N_ANC as outlined in dadi’s manual. N_ANC was calculated as $Θ / (4 L μ)$ , where L is the total sequence length from which SNPs used in demographic analyses originated and $μ = 1 \times 10^{- 8}$ . The total sequence length was calculated as $S \times (V_{U} / V_{T})$ , where S is the number of sites (variant and invariant) retained in ANGSD to calculate the jAFS, V_U is the number of unlinked SNPs retained for demographic modeling and V_T is total number of variants in the jAFS before linked SNPs were removed. Time parameters were scaled assuming a generation time of 3.5 years to be directly comparable with a previous study on the same species (Le Moan et al. 2019).

The use of AIC to rank models and of the Fisher Information Matrix to estimate parameter uncertainty relies on the assumption that genetic data are independent. For RAD data, it is generally assumed that keeping one SNP per RAD locus is sufficient to satisfy this assumption. Nevertheless, we carried out further analyses to ensure that unaccounted linkage did not lead to 1) biased estimates of parameter uncertainty and 2) favoring more complex models. Firstly, we estimated parameter uncertainty for the two best models (that according to W_AIC had similar support) using block-bootstraps and the Godambe Information Matrix (GIM). Secondly, after the two best-fitting models had been identified (in this case, IM $_{AEB 2 N}$ and SC $_{AEB 2 N}$ ), we used LRT to test for the support of specific parameters among nested models, controlling for Type I error by normalizing the difference in log-likelihoods as outlined in Coffman et al. (2016). We used LRT to test support for an SC scenario by comparing IM $_{AEB 2 N}$ and SC $_{AEB 2 N}$ models. Since the test failed to reject the simpler IM scenario, we further used LRT to test for support for a past bottleneck in the Baltic Sea population (comparing IM $_{2 N}$ and IM $_{B 2 N}$ models) and for an ancestral expansion (comparing IM $_{2 N}$ and IM $_{AE 2 N}$ models). The correct weighting of χ² distributions for the LRT were calculated according to Ota et al. (2000).

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol