GBS library preparation and population genetic analysis

David I. Dayan; Xiao Du; Tara Z. Baris; Dominique N. Wagner; Douglas L. Crawford; Marjorie F. Oleksiak

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

GBS library preparation and population genetic analysis

DD David I. Dayan

XD Xiao Du

TB Tara Z. Baris

DW Dominique N. Wagner

DC Douglas L. Crawford

MO Marjorie F. Oleksiak

This method is extracted from research article: BMC Evol Biol, Feb 2019

Population genomics of rapid evolution in natural populations: polygenic selection in response to power station thermal effluents

DOI: 10.1186/s12862-019-1392-5

Request a Protocol

Ask a question

Favorite

Fin clips from the terminal margin of the caudal fin approximately 5 mm² in size were taken from individuals in the field using scissors and stored in 270 ul of Chaos buffer (4.5 M guanadinium thiocynate, 2% N-lauroylsarcosine, 50 mM EDTA, 25 mM Tris-HCl pH 7.5, 0.2% antifoam, 0.1 M β-mercaptoethanol); these samples were stored at 4 °C prior to processing. Genomic DNA was isolated from fin clips using an Epoch Life Sciences silica column [95]. DNA quality was assessed via gel electrophoresis and concentrations were quantified in triplicate using Biotium AccuBlueTM Broad Range dsDNA Quantitative Solution according to manufacturer’s instructions. 100 ng of DNA from each sample was dried down in 96-well plates. Samples were then hydrated overnight with 5 ul of water before restriction enzyme digestion and further processing.

Genomic DNA (gDNA) was isolated from 296 individuals. These gDNA samples were individually barcoded and used to create a reduced representation library for genotyping by sequencing (GBS) [96]. The library was created in duplicate with barcode assignment of individuals randomized across both replicate libraries. Each replicate library was sequenced on 1 Illumina Hi-Seq2500 lane. The combined raw dataset consisted of 274,102,532 single-end, 75 bp reads. The reference genome-based GBS analysis pipeline, TASSEL [97] was used to call SNPs using the Fundulus heteroclitus genome [46]; SNPs were identified using the “Discovery Build.” In brief, TASSEL collapses identical, individually barcoded, short sequence reads, aligns these reads to a reference genome and calls genotypes on the basis of allelic redundancy using up to 127 reads per unique sequence tag per individual. Reads beyond this depth are ignored. Heterozygotes are called using a binomial-likelihood approach [98]. A log of console input for the pipeline is available upon request. We largely used default settings throughout the pipeline with the following exceptions: a minimum of 5 counts were required for retention of unique tags during the merge multiple tag count fork, and tag alignment to the reference genome was accomplished with bowtie2 using the very-sensitive-local setting.

We found 1,451,801 unique sequence tags that contained both the barcode and AseI cut site. Bowtie aligned 1,142,340 (78.7%) of these tags to unique loci in the F. heteroclitus genome [46]; 159,591 (11.0%) sequence tags aligned to multiple loci, and 149,870 (10.3%) had no alignment. The latter two tag sets were excluded from further analysis. Heterozygotes were called using a binomial likelihood ratio based approach of quantitative genotype calling, as implemented in the TASSEL-GBS discovery pipeline [98]. Among the 1.1 million tags that singly aligned to the F. heteroclitus genome, we identified 314,746 SNPs.

We filtered the 314,746 SNPs identified by the TASSEL discovery pipeline among all 296 individuals in the library. We removed fifty-seven individuals missing more than 12.5% of SNPs. To remove polymorphisms that may have arisen from sequencing and amplification errors or alignment across paralogs (versus polymorphisms between alleles), we then filtered the SNP dataset by coverage, minor allele frequency and whether observed heterozygosity (H_o) was significantly greater than the expected heterozygosity (H_e) [99, 100]. Retaining SNPs that were called in at least 85% of the remaining 239 individuals, resulted in 5907 SNPs. Of the 5907 SNPs in 239 individuals, 110 with minor allele frequencies less than 1% were removed. Hardy-Weinberg equilibrium was calculated for individual loci using Arlequin v3.5.1.2 [101] using 1000,000 steps in the Markov chain with 100,000 dememorization steps. Then, 348 SNPs with H_o > H_e that exceeded Hardy-Weinberg equilibrium at p < 0.01 were removed. Thus, the fully filtered SNP dataset consisted of 5449 SNPs among 239 individuals. Mean read depth per SNP per individual was 26.29 ± 0.43 for the full SNP dataset (Additional file 1: Figure S1 and Additional file 11: Figure S6). Most (64.3%) SNPs in the final full SNP dataset have at least 10 reads in all individuals. The TASSEL-GBS pipeline caps the number of reads used to make a call at 127 for each allele. Therefore, the range of read depth per individual per SNP in the dataset was 0–254 (2*127), and the high frequency of 127 and 254 counts per SNP per individual in the read depth frequency distribution results from highly sequenced individual–by-SNP combinations.

Outlier scans were conducted with FDIST2 [102] as implemented in LOSITAN [53]. For all pairwise population comparisons we used the same settings. We culled loci that are potential outliers to more narrowly estimate initial mean F_ST values (neutral mean F_ST option) and used the bisection approximation algorithm to estimate mean F_ST values (force mean F_ST option). After these steps, we conducted 50 k simulations. To control for multiple comparisons, we adjusted empirical p-values provided by LOSITAN with a modified FDR of 5% [54, 55]. These results were also used to construct putatively neutral panels of SNPs in each triad for population genetic inference. This dataset includes SNPs that are not outliers due to either putative directional or balancing selection (i.e. 0.05 < p-simulated < 0.95) in any of the three pairwise comparisons within a triad.

Population genetic parameters were calculated in a variety of statistical packages. Isolation by distance was tested using a Mantel test (9999 simulations) in the ade4 R package [103]. Arlequin v3.5.1.2 was used to calculate the proportion of inter-population pairwise differences, intra-population estimates of nucleotide diversity (π), and AMOVA (99,999 permutations). Linkage disequilibrium (r²) and sliding window nucleotide diversity was calculated in the tassel GUI. Smoothing conditional mean r² and F_ST along the genome was accomplished using LOESS. The span for each fit was chosen using the bias-corrected Akaike information criterion [104].

Population genetic structure inference was made using STRUCTURE [105] and DAPC [56]. STRUCTURE analyses were performed on five subsets of the data: (a) the complete SNP dataset within each triad (effluent population + two reference populations), (b) a LD thinned dataset within each triad consisting of 3992 SNPs created by randomly removing one SNP from any pair with r² > 0.5 (c) a “neutral” SNP dataset with both potential directional and balancing selection outlier loci (p–value: 0.05) excluded, (d) a SNP dataset consisting of all pairwise directional selection outliers within a triad, i.e. the most differentiated loci among the population within a triad and (e) a minor allele frequency matched neutral dataset. For (e), we bootstrap sampled the “neutral” dataset for each triad, so that the allele frequency distribution matched that of the dataset composed of the most differentiated loci. For all analyses, we used a burn in of 10 k steps and MCMC of 20 k steps with at least 7 replicates for each k value and varied k from 1 to 6. In all replicates, burn in was sufficient for convergence in the STRUCTURE parameters: α, F, D and likelihood. We specified an admixture model with correlated allele frequency to improve clustering among closely related populations within a triad [105]. Replicate individual k runs were merged and the ΔK method for identifying the optimal number of clusters [106] was calculated using CLUMPAK [107]. For Oyster Creek, likelihood scores (Pr(X|K)) beyond K = 2 increase at a decreasing rate (Additional file 4: Figure S3a), and the ΔK method identifies K = 2 as the optimal number of population clusters (Additional file 4: Figure S3c). For Brayton Pt, although the ΔK method identifies K = 4 as the optimal number of population clusters, likelihood scores (Pr(X|K)) plateau at K = 3 (Additional file 4: Figure S3b and d), and results at K higher than 3 are qualitatively similar. The first genetic split (K = 2) separated the southern reference population (Matunuck, RI) from both the TE and the northern reference populations (Horseneck Beach, MA).

For DAPC, we used the full SNP dataset. For the Brayton Point triad, we identified 2 as the optimal number of principal components (PCs) (Additional file 4: Figure S3 h), although using up to 40 PCs produced qualitatively similar results. DAPC perfectly matched inferred genetic clusters with the three populations using these 2 PCs. Furthermore, the Bayesian information criterion (BIC) of K-means clustering from K = 1–12 demonstrated an inflection point at K = 3 (Additional file 4: Figure S3f). These results remained consistent when up to 40 PCs were used.

For the Oyster Creek triad, the a-score was low from 1 to 90 retained PCs; i.e., individuals from single populations did not reliably fall into single genetic clusters (Additional file 4: Figure S3 g). Similarly, the BIC for successive K-means clustering suggested 1 as the appropriate number of genetic clusters to describe the data (Additional file 4: Figure S3e), despite the extent of population differences indicated by pairwise genome-wide F_ST value estimates (Table (Table1)1) and significant among population differences in the AMOVA (Table 2). As our goal was to use DAPC to describe the genetic structure among fish capture from the three sampling sites along orthogonal axes of genetic variation rather than to establish the extent of true population differences, we performed DAPC using population (sampling location) as the grouping factor rather than inferred genetic clusters, as is common. This analysis maximizes the differences among a priori assigned populations rather than among the putative genetic clusters contained in our dataset and therefore introduces the possibility of overfitting differences among a priori identified populations (sampling locations). Including more or less PCs in this DAPC did not change the relationship among populations from 5 to 90 PCs, but populations were more distinct with more PCs and therefore more genetic variation was incorporated into the DAPC. We retained the first 32 PCs.

We also examined the extent to which spatial and environmental (effluent) variables could be used to explain genetic variation among populations using redundancy analysis (RDA) and partial redundancy analysis (pRDA). RDA is a form of constrained ordination that can be used to describe linear relationships among components of multivariate response and multiple multivariate explanatory variables [57]. In pRDA, redundancy analysis is conducted on the residuals of the response variable after modeling the effect of one or more of the explanatory variables using RDA. In our RDA we used major principal components of genetic variation as the response variable, with the number of retained principal components determined by the Kaiser-Guttman criterion [108], and two explanatory variables: a spatial variable and the presence or absence of effluents. For the spatial variable we employed distance-based Moran’s eigenvector maps (dbMEMs). dbMEMs are capable of describing spatial variation at multiple scales, including spatial autocorrelation as well as local structures and are suitable as explanatory variables for constrained ordinations, such as RDA [109]. We retained all dbMEMs with positive values of Moran’s I as our spatial explanatory variables because we were interested in modeling only the effect of positive spatial autocorrelation on genetic variation using the spatial variable. Significance of the RDA is then tested using empirical p-values (permuting the response variables). We conducted the RDA using the R package vegan [110], using the rda() command and the anova.cca() command to test its significance. dbMEMs are calculated using a distance matrix of alongshore (5 m depth constrained) distances among sampling locations and the dbmem() command from the R package adespatial [111]. To calculate the proportion of variance explained by the constrained axes of the RDA, we used variance partitioning [112] implemented in the varpart() command of vegan in R. Because variation in PCA is equivalent to F_ST [113], we used our ordination results to calculate the proportion of among population variation explained by each of the models, using total constrained variation weighted by the portion of variation explained among the major principal components of genetic variation as the numerator and F_ST as the denominator [114]. To test the significance of each set explanatory variables in explaining genetic variation separately, we conducted pRDA, conditioning the genetic variation on each spatial and effluent variables, followed by a permutation analysis, again implemented in vegan using the rda() and anova() commands. We conducted three RDAs with pRDAs: a global RDA of both triads, and each triad separately.

Functional enrichment among candidate loci was assessed with DAVID v6.7 [115]. Candidate loci were annotated using the F. heteroclitus genome [46] where gene identifiers are based on expression and homology evidence. These gene identifiers were submitted to the Functional Annotation Clustering tool in DAVID v6.7 against a background of all F. heteroclitus gene annotations using the “high” classification stringency. DAVID’s Functional Annotation Clustering Tool groups similar functional annotations and collectively evaluates their enrichment. The statistical significance of annotation clusters is evaluated on the basis of DAVID’s EASE score: the mean value of the negative log-transformed, FDR corrected p-values of enrichment for all genes included in that cluster [115]. Clusters of enriched annotation terms with an EASE score greater than 1.3 are considered significant.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol