Locus-specific phylogeny from loci not containing protein-coding genes. To infer baboon phylogeny from a series of independent putatively neutral loci (gene trees), we identified and excluded bases with any of the following characteristics: genic regions, based on refGene annotations accessed via the refGene table of UCSC’s Genome Browser; bases with coverage <7× for any individual in the diversity panel; bases with any missing genotype; bases with read depth above the 95th percentile; CpG sites; bases within 3 bp of an indel; repetitive DNA as designated by RepeatMasker (using the open-3-3-0 version of RepeatMasker with sensitive setting RepBase library release 20110920) and Tandem Repeats Finder (period of 12 or less); and bases within 100 bp of a phastCons element. For the bases that passed filtration, we concatenated nearby loci separated by gaps of less than 1 kb using BEDtools. We then retained only loci of size 1 to 100 kb for further analyses to maximize information content while still reducing the chance of unappreciated recombination.

For each locus, we extracted the sequence for all baboon diversity panel individuals, replacing heterozygous sites with their corresponding IUPAC (International Union of Pure and Applied Chemistry) codes. To get an outgroup sequence for each locus, we compared the baboon reference genome (Panu_2.0) to the rhesus macaque reference genome sequence (rheMac2) using megablast, retaining hits with e values less than 1 × 10-100 and at least 95% identity. Only loci with a single rhesus match were retained. All analyses used custom scripts and BEDtools, SAMtools, and VCFtools. We aligned the sequences at each locus with Muscle using default parameters.

We inferred a gene tree for each locus using MrBayes 3.2.1 (50), setting the outgroup to the rhesus sequence. For each alignment, we used a GTR+G model of molecular evolution, and we ran MrBayes twice for 1,000,000 generations, sampling every 100. We assessed convergence by checking statistics in MrBayeslog (LnL, PSRF, and average SD of split frequencies <0.01) and by using Tracer v.1.5 to estimate the effective sample size as >200 and to compare the performance of the independent analyses. After checking for convergence, we summarized the posterior distribution of trees after removing the first 25% of generations.

Filtration for phylogenetic information content and BCA. We used the program mbsum from BUCKy v 1.4.2 to summarize the posterior tree output from MrBayes for each locus (31, 55). As is expected with a recent radiation, many loci had limited phylogenetic information content, decreasing the signal-to-noise ratio and increasing concordance analysis runtime. To quickly remove loci with limited signal, we filtered loci based on the frequency of the highest supported tree in the MrBayes tally of tree topologies. A locus with no information is expected to output a flat distribution of random topologies, each appearing once. In contrast, a locus with strong signal will have the same topology occur multiple times. We removed loci if the most frequently supported topology occurred in fewer than 10% of trials. We then ran the main BUCKy program with default parameters to infer CF summary statistics that describe the proportion of trees that contain a particular clade.

Genic tree analysis of sequences containing annotated protein-coding genes. We also analyzed 3267 chromosomal segments that each contain one annotated protein-coding gene. Segments were selected based on refGene tables within the UCSC Genome Browser. The inference of local tree topologies was performed as for the loci discussed above that do not contain protein-coding genes. We began with 3267 genic segments, but following filtering for phylogenetic signal, length, and other criteria, we obtained final results for 2201. For each of these genic segments, we computed pairwise Euclidean distances using the Kendall and Colijn metric (56) and then performed PCA (Principal Component Analysis) on this distance matrix to group trees into six clusters based on tree similarity. We chose to further investigate the first three clusters. We performed a GO overrepresentation test using as a reference list the genes for all trees that survived filtering. Results of this for GO biological processes and molecular functions are reported in table S10.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.