The UNEAK pipeline (Lu et al., 2013) was used to identify SNPs from RAD-seq data and to genotype individuals. Tags with fewer than five reads were excluded from tag pair identification. After filtering for SNPs with <50 % missing data, a minimum minor allele frequency of 0.01 and no apparent heterozygous genotypes in any of three doubled haploid M. sinensis accessions (Głowacka et al. 2012), 34 605 SNPs remained (Supplementary Data S2). In M. sinensis, we previously found improved resolution of population structure by using these filtering parameters as compared with more stringent parameters (Clark et al., 2014), which we confirmed in M. sacchariflorus (Supplementary Data Fig. S2).
Structure 2.3.4 (Falush et al. 2003), discriminant analysis of principal components (DAPC) implemented in the R package adegenet (Jombart et al., 2010) and TESS3 implemented in the tess3r R package (Caye et al., 2016) were used to assess population structure, hybridization and major groupings of M. sacchariflorus. Structure was run with 50 000 Markov chain Monte Carlo (MCMC) reps after a 10 000 rep burn-in using the admixture model with correlated allele frequencies. A preliminary Structure run at K = 1 was performed to infer lambda (the parameter for determining allele frequency priors), which was estimated at 0.4686. This value of lambda was then fixed, and six Structure runs each at K = 1–10 were performed. The Evanno method (Evanno et al., 2005) was used to explore possible values of K (Supplementary Data Fig. S3). Principal components analysis for DAPC was performed using the ‘glPca’ function of adegenet, with parameters set so that each marker would be centred and scaled, and all principal components would be retained. The ‘find.clusters’ function was then used to make initial groupings, with 1000 randomly chosen centroids to ensure convergence, and all principal components included. The Bayesian information criterion was used as a guide for selection of the number of clusters (Supplementary Data Fig. S4). DAPC was then performed with the ‘dapc’ function using the groupings from ‘find.clusters’, the first 250 principal components and all discriminant axes for K = 2–10. TESS3 was run in six replicates at K = 1–10, with the optimal replicate at each K being selected by the software. The cross-validation score at each value of K was plotted in order to help identify the optimal K value (Supplementary Data Fig. S5). The K value that was ultimately chosen for further analysis was the lowest K that resulted in consistency between groups identified by Structure, DAPC and TESS3, and was also biologically and geographically meaningful (Supplementary Data Fig. S6).
Allele frequencies within DAPC groups were estimated directly from the sampled genotypes for the diploid groups, and with the R package polyfreqs (Blischak et al., 2016) for the tetraploid and triploid groups. SNP diversity (D) was estimated as the expected heterozygosity (probability of drawing two different alleles from the population) at each SNP, averaged across all SNPs (Nei, 1973). Allelic richness was estimated using the extrapolation method of Foulley and Ollivier (2006). Pairwise Jost’s D (Jost, 2008) among DAPC groups was estimated from genotypes and allele frequencies using a custom R function (available at doi:10.5281/zenodo.58614). Pairwise Jost’s D values were then used as a distance matrix for the calculation of a Neighbor–Joining tree among populations using the R package ape (Paradis et al., 2004). The software TreeMix (Pickrell and Pritchard, 2012) was used to infer evolutionary relationships and gene flow among the groups identified by DAPC, including one group for M. sinensis. The number of migration edges selected was the highest number at which each edge connected a unique pair of tree edges.
A second Neighbor–Joining tree was calculated to show relationships among individuals. To reduce the confounding effect of hybridization on the calculation of a Neighbor–Joining tree between Miscanthus individuals, individuals in M. × giganteus DAPC groups were excluded, as were SNPs that were highly differentiated between M. sinensis and M. sacchariflorus according to Structure results, leaving 731 individuals and 31 743 SNPs, respectively. Manhattan distances were calculated in R between each pair of individual genotypes; the distance between two genotypes homozygous for different alleles was 2, the distance between a homozygous and heterozygous genotype was 1 and the distance between identical genotypes was 0; these distances were then summed across all SNPs and scaled by the proportion of non-missing data (Black, 2006). Manhattan distances were used for calculating the Neighbor–Joining tree using the R package ape (Paradis et al., 2004). This same genetic distance matrix was used for conducting Mantel tests using the R package ade4 (Chessel et al., 2004) utilizing geographic distances calculated with the R package geosphere (Hijmans, 2017).
To test hypotheses about the history of population divergence and admixture, scenarios were tested using DIYABC 2.1.0 (Cornuet et al., 2014). The origins of tetraploid populations and the relationships among diploid populations were examined separately. In both cases, six scenarios with uniform prior probabilities were tested using a total of 600 000 simulations. Posterior probabilities of scenarios were estimated using the logistic regression approach.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.