Ancestral protein reconstruction

RG Robert D. Grinshpon
SS Suman Shrestha
JT James Titus-McQuillan
PH Paul T. Hamilton
PS Paul D. Swartz
AC A. Clay Clark
request Request a Protocol
ask Ask a question
Favorite

Two lists of taxa were used for ancestral protein reconstruction (APR). One list (APR_1, Supplementary Information, Table S1) was generated using a precursor to the CaspBase [42] (caspbase.org), and one list (APR_2, Supplementary Information, Table S2) queried the CaspBase. Each list has high sequence coverage spanning the majority of known proteins within the caspase family. While APR_1 has a total of 253 caspase sequences, APR_2 has a total of 258 caspase sequences. APR_2 emphasizes mammal lineages over non-mammal lineages, with 127 mammal caspase sequences, while APR_1 has 82 mammalian caspase sequences. There is an overlap of 39.6% between the two data sets. Care was taken in both lists to mitigate erroneous sequences, to eliminate incomplete lineages sorting by including high coverage across all representative taxa from all major vertebrate groups, and to incorporate full gene tree representation of each known caspase family member within each vertebrate group. The prodomain was pruned from our sequences because the prodomains have high sequence variations and lengths in the caspase family, and their inclusion results in missing data and noise to downstream analyses. The multiple sequence alignment (MSA) was computed using PROMALS3D [43,44]. Alignments were checked in Geneious [45] to assess alignment accuracy, and we utilized Prottest 3 [46] to generate the proper model for phylogenetic analysis using AICc (Akaike Information Criterion) weights to gather the highest probable model of protein evolution [47]. The phylogenetic tree was generated using IQTREE [48], using a combination of hill-climbing approaches and stochastic perturbation methods for accuracy and time-efficiency, and the tree was bootstrapped 1000 times as a test of phylogeny [49]. The tree was examined to remove erroneous sequences, mislabels, and to mitigate missing data, resulting in a highly effective alignment for APR. The APRs were constructed with FastML [50], using codon-bases reconstruction models for accuracy since the models were generated from whole annotated genomes with complete metadata. We used a LG model of substitution [51] generated by Prottest 3, and our framework used maximum likelihood (ML) for indel reconstruction. We provided our ML tree as a guide, optimizing branch lengths with highly divergent sequences, set gamma distribution, and computed a joint reconstruction to generate APRs at each node of interest. Sequences were codon-optimized for expression in E. coli, cloned into pET11a vector and included a C-terminal His6-tag (GenScript, U.S.A.). The AncCP-6An was also designed similarly to the caspase-6 CT (constitutive two-chain) construct described previously [52]. All proteins were purified as described previously [5355].

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A