For each transcriptome of the 1KP [11], we used the DupPipe pipeline to construct gene families and estimate the age of gene duplications [9]. We identified duplicate pairs as sequences that demonstrate 40% sequence similarity over ≥300 base pairs from a discontinguous MegaBLAST [12, 13]. We translated DNA sequences and identified reading frames by comparing the Genewise (Genewise, RRID:SCR_015054) [14] alignment to the best-hit protein from a collection of proteins from 25 plant genomes from Phytozome (Phytozome, RRID:SCR_006507) [15]. For each analysis, we used protein-guided DNA alignments to align our nucleic acid sequences while maintaining reading frame. Best-hit proteins are paired with each gene at a minimum cut-off of 30% sequence similarity over ≥150 sites. Gene families are then constructed by single-linkage clustering. We then estimated synonymous divergence (Ks) using PAML (PAML, RRID:SCR_014932) [16] with the F3 × 4 model for each node in the gene family phylogenies. A recent study has shown that estimating the node Ks values for duplicates from gene family trees rather than pairwise comparisons of paralogs can reduce error in estimating Ks values of duplication events and has a significant effect on the resolution of WGD peaks [17]. In this project, we used the approach described in Tiley et al. 2018 [17]. Previous analyses also indicate that there is reasonable power to infer WGDs in Ks plots when paralog divergences are Ks < 2. Saturation and other errors accumulate at paralog divergences of Ks > 2 and can create false signals of WGDs and make distinguishing true WGDs from the background a fraught task [17, 18]. We followed the recommendations of these studies in all of our 1KP Ks plot inferences. Although we plotted and presented 2 sets of histograms with x-axis scales of Ks = 2 and Ks = 5 to assess WGDs at different resolutions (Figs 1, 2), we did not identify peaks with Ks > 2 as potential WGDs without other data available (e.g., synteny or phylogenomic evidence). Note that this means the rate of substitution in a lineage limits the depth of time at which we can reliably infer the presence or absence of putative WGDs. Here, we provided the 1,153 raw output files from the DupPipe pipeline and the 2,306 Ks plots generated in these analyses. Each raw output file is a tab-delimited text file containing the node Ks value for each duplication. Gene annotation from the Arabidopsis thaliana gene ontology is provided. All files are available in bitbucket and GigaDB [19].
Histograms of the age distribution of gene duplications (Ks plots) with mixture models of inferred WGDs for (a)Pandorina morum (green algae), no inferred WGD peak. (b)Sphagnum recurvatum (moss), inferred WGD peak median Ks = 0.38. (c)Diphasiastrum digitatum (lycophyte), inferred WGD peak median Ks = 0.42, 1.62. (d)Ceratopteris thalictroides (fern), inferred WGD peak median Ks = 1.08. (e)Pseudotsuga wilsoniana (gymnosperm), inferred WGD peak median Ks = 0.38, 1.18. (f)Ipomoea nil (angiosperm) inferred WGD peak median Ks = 0.66. Histogram x-axis scale is Ks 0–2. The mixture model distributions consistent with inferred ancient WGDs are highlighted in yellow.
Histograms of the age distribution of gene duplications (Ks plots) with mixture models of inferred WGDs for (a)Pandorina morum (green algae), no inferred WGD peak. (b)Sphagnum recurvatum (moss), inferred WGD peak median Ks = 0.38. (c)Diphasiastrum digitatum (lycophyte), inferred WGD peak median Ks = 0.42, 1.62. (d)Ceratopteris thalictroides (fern), inferred WGD peak median Ks = 1.08, 3.07. (e)Pseudotsuga wilsoniana (gymnosperm), inferred WGD peak median Ks = 0.38, 1.18. (f)Ipomoea nil (angiosperm) inferred WGD peak median Ks = 0.66, 2.15. Histogram x-axis scale is Ks 0–5. The mixture model distributions consistent with inferred ancient WGDs are highlighted in green.
To identify significant features in the gene age distributions that may correspond to WGDs, we used 2 statistical tests: Kolmogorov-Smirnov (K-S) goodness-of-fit tests and mixture models. We first identified taxa with potential WGDs by comparing their paralog ages to a simulated null distribution without ancient WGDs using a K-S goodness-of-fit test [20]. For taxa with evidence for a significant peak relative to the null, we then used a mixture model implemented in the mixtools R package [21] to identify significant peaks of gene duplication consistent with WGDs and estimate their median Ks values (Figs 1 and 2). These approaches have been used to infer WGDs in Ks plots in many species that were subsequently corroborated by syntenic analyses of whole-genome sequences [20, 22–24]. There is a recent trend in the community of authors simply surveying the Ks plots of single species without a model or statistical inference to infer a WGD (e.g., [25–28]). By using these 2 statistical tests, our results have been more rigorously evaluated than many recent studies of WGDs.
To visually demonstrate our gene age distribution approach, we provide example Ks plots for 4 major lineages across the green plant phylogeny. In the green alga Pandorina morum, the K-S test indicated that the paralog age distribution was significantly different than a simulated null. However, we do not observe any peaks of duplication consistent with the expected signature of an ancient WGD from the 2 sets of histograms (Figs 1a, ,2a).2a). In other land plant examples, the K-S test also found that paralog age distributions were significantly different than null simulations (P < 0.001). In the bryophyte example, we observed single peaks of duplication consistent with an ancient WGD in the Ks plots of each species (Sphagnum recurvatum, median Ks = 0.38, Figs 1b, ,2b).In2b).In the lycophyte, fern, gymnosperm, and angiosperm examples, we observed 2 peaks of duplication consistent with 2 rounds of putative ancient WGD in each species. The mixtools mixture models estimated that these putative WGD peaks have median Ks of 0.42 and 1.62 in Diphasiastrum digitatum (Figs 1c, 2c), median Ks of 1.08 and 3.07 in Ceratopteris thalictroides(Figs 1d, 2d), median Ks values of 0.38 and 1.18 in Pseudotsuga wilsoniana (Figs (Figs1e,1e, ,2e),2e), and median Ks values of 0.66 and 2.15 in Ipomoea nil (Figs 1f, ,2f2f).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.