We used 17 data sets from 11 published studies of eukaryotes and 2 published studies of prokaryotes that cover the major groups in the tree of life (table 1). These data were selected for relative completeness (missing data <50%) and large sample size (>80 sequences). As we know, a large amount of missing data (>50%) can result in unreliable estimates of branch lengths and other phylogenetic errors (Wiens and Moen 2008; Lemmon et al. 2009; Filipski et al. 2014; Xi et al. 2016; Marin and Hedges 2018) and potentially bias CorrTest results. When a phylogeny with branch lengths was available from the original study, we estimated relative rates directly from the branch lengths via RRF (Tamura et al. 2018) and computed selected features (ρs, ρad, d1, and d2) to conduct CorrTest. Otherwise, maximum likelihood estimates of branch lengths were obtained in MEGA 7 command line version (Kumar et al. 2012; Kumar et al. 2016) using the published topology, sequence alignments, and the substitution model specified in the original article. To examine the impact of the specification of a time-reversible substitution model on CorrTest, we estimated branch lengths under an unrestricted substitution model (Yang 1994) for all the nucleotide data sets in PAML (Yang 2007) and conducted CorrTest.
To obtain the autocorrelation parameter (v), we used MCMCTree (Yang 2007) with the same input priors as the original study, but omitting calibration priors to avoid the influence of calibration uncertainty densities on the estimate of v. We did, however, provide a root calibration because MCMCTree required it. For this purpose, we specified the root calibration as the one used in the original article or as the median age of the root node in the TimeTree database (Hedges et al. 2006; Kumar et al. 2017) ±50 My (uniform distribution with 2.5% relaxation on minimum and maximum bounds). Bayesian analyses required long computational times, so we used the original alignments in MCMCTree to infer v if alignments were shorter than 20,000 sites. If the alignments were longer than 20,000 sites, we randomly selected 20,000 sites from the original alignments. However, one data set (Ruhfel et al. 2014) contained more than 300 ingroup species, such that even alignments of 20,000 sites required prohibitive amounts of memory. In this case, we randomly selected 2,000 sites from the original alignments to use in MCMCTree for v inference (similar results were obtained with a different site subset). Two independent runs were conducted for each data set, and results were checked in Tracer (Rambaut et al. 2018) for convergence. ESS values were higher than 200 after removing 10% burn-in samples for each run. All empirical data sets are available at https://github.com/cathyqqtao/CorrTest (last accessed February 6, 2019).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.