Whole-genome alignment and divergence estimation

RC Rory J Craig
AH Ahmed R Hasan
RN Rob W Ness
PK Peter D Keightley
request Request a Protocol
ask Ask a question
Favorite

An eight-species core-Reinhardtinia WGA was produced using Cactus (Armstrong et al., 2019) with all available high-quality genomes (C. reinhardtii v5, C. incerta, C. schloesseri, E. debaryana, G. pectorale, Y. unicocca, Eudorina sp., and V. carteri v2). The required guide phylogeny was produced by extracting alignments of 4D sites from single copy orthologs identified by BUSCO (genome mode, Chlorophyta odb10 dataset). Protein sequences derived from 1,543 BUSCO genes present in all eight species were aligned with MAFFT and subsequently back-translated to nucleotide sequences. Sites where the aligned codon in all eight species contained a 4D site were then extracted (250,361 sites), and a guide-phylogeny was produced by supplying the 4D site alignment and topology (extracted from the Volvocales species-tree, see above) to phyloFit (PHAST v1.4; Siepel et al., 2005), which was run with default parameters (i.e. GTR substitution model).

Where available, the R domain of the MT locus not included in a given assembly was appended as an additional contig (extracted from the following NCBI accessions: C. reinhardtii MT GU814015.1, G. pectorale MT+ LC062719.1, Y. unicocca MT LC314413.1, Eudorina sp. MT male LC314415.1, V. carteri MT male GU784916.1). All genomes were soft-masked for repeats as described above, and Cactus was run using the guide-phylogeny, with all genomes set as reference quality. Post-processing was performed by extracting a multiple alignment format (MAF) alignment with C. reinhardtii as the reference genome from the resulting hierarchical alignment (HAL) file, using the HAL tools command hal2maf (v2.1; Hickey et al., 2013), with the options –onlyOrthologs and –noAncestors. Paralogous alignments were reduced to one sequence per species by retaining the sequence with the highest similarity to the consensus of the alignment block, using mafDuplicateFilter (mafTools suite v0.1; Earl et al., 2014).

Final estimates of putatively neutral divergence were obtained using a method adopted from Green et al. (2014). For each C. reinhardtii protein-coding gene, the alignment of each exon was extracted and concatenated. For the subsequent CDS alignments, a site was considered to be 4D if the codon in C. reinhardtii included a 4D site, and all seven other species had a triplet of aligned bases that also included a 4D site at the same position (i.e. the aligned triplet was assumed to be a valid codon, based on its alignment to a C. reinhardtii codon). The resulting alignment of 1,552,562 sites was then passed to phyloFit with the species tree, as described above.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A