Protein fasta files were retrieved from Ensembl v81: C. savignyi, P. marinus, L. chalumnae; Aniseed [23]: C. robusta, B. schlosseri, M. oculata, M. occidentalis, and B. leachii. Proteins from B. floridae were retrieved in JGI [62] and for O. dioica from Oikoarrays [63].
Functional annotation from all retrieved species were assessed using eggNOG-Mapper v.2 [57], based on the database eggNOG v.5 [56] applying DIAMOND as referred on [64].
Enrichment analysis was calculated with goatools [65] taking as background group the complete set of proteins reported for studied chordata species and the comparison group, the list of proteins for each species. The association file between proteins and GO was generated based on eggNOG-Mapper results, all the command line methods are described in Section S12. Final results of enrichment were plotted using ggplot2, tidyverse [66,67] and grid [68] R packages. TreeMap plots were performed with REVIGO [69]. Calculated p-values from goatools were used as input data to REVIGO webserver against UniProt-to-GO database (ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_gcrp.gaf.gz, accessed on 14 April 2020) and SimRel as semantic similarity measure.
Proteins with the same semantic terms in the REVIGO results were clustered and subject to a protein–protein interaction analysis using STRING (v.11) [70]. Proteins from D. vexillum were compared against the entire Chordata protein set. As C. robusta was the species with the largest number of recognizable homologs, this species was used as reference. Only connected nodes with “high” or “highest” confidence were analyzed and visualized.
A collection of reported homeobox proteins from human (of the family Homeoboxes (516) (https://www.genenames.org/cgi-bin/genegroup/download?id=516&type=branch, accessed on 10 October 2019) from HGNC database [71]), C. robusta, C. savignyi (both species from Ensembl v100 [72]), B. leachii [25], H. roretzi [73,74], and a variety of species from the HomeoDB [75] were retrieved from the corresponding references. This set was used to search along the annotated transcriptome and protein sequences from D. vexillum using tblastn, and blastp, respectively. The best candidates were obtained with an identity percent of ≥35, E-value and a query coverage of 70%. For command line details refer to Section S12.
As a complement, pairwise genome alignments with the new assembly from D. vexillum and close species that reported annotations of homeobox genes: B. floridae, B. leachii, B. schlosseri, C. savignyi, C. robusta, H. roretzi, and O. dioica, were performed with LASTZ [76]. References from homeobox genes were obtained from Aniseed [23] using the Gene Builder with the term hox, except from B. floridae where updated annotations (for v.2) were searched and retrieved from LanceletDB [77]. Cross-matching of shared regions and reported genes and homology searches were performed to support the identification of homeobox candidates.
We searched for RUNX, SOX, and Hh homologs in the output of eggNOG-Mapper for all studied chordate species. The corresponding orthology groups have the accession numbers: KOG3982, KOG0527, and KOG3638, respectively. Due to the lack of true RUNX orthologs on D. vexillum, we performed an additional analysis to confirm the presence of some homology signal. We retrieved the RUNX sequences reported on [78], from available 16 chordates from NCBI: AN08565.1, AAN08567.1, AAQ88389.1, AAS02047.1, AAS21356.1, BAA03485.1, BAF36001.1, BAF36011.1, EAX04278.1, EDL03777.1, EDL29993.1, ENSCINT00000004611.3, NP_001001890.1, NP_001092121.1, NP_004341.1 and NP_571678.1. Those sequences were searched with blastp in the proteome of D. vexillum and the following 10 species: B. floridae, B. leachii, B. schlosseri, C. robusta, C. savignyi, M. oculata, M. occidentalis, O. dioica, P. marinus, and L. chalumnae. On the other hand, the PFAM domain Runt (PF00853) was searched along all the reported proteomes of the described species using hmmscan (HMMER v.3.1b1) [79]. Filtering was based on the gathering score reported by PFAM and a low E-value <0.001.
Phylogenetic analysis from mentioned proteins was performed on the set of orthologs from the described target species and their corresponding orthologous sequences that have been obtained by the eggNOG-Mapper analysis. As an outgroup, we obtained from NCBI, the following sequences for RUNX: NP_999779.1 and XP_781626.2; for Hh: FBpp0121221, KDR14772, and XP_008546836.1.
For the analysis of RUNX, we included the reported sequences of lamprey (Lethenteron camtschaticum) [80], annotated in NCBI with the following accession numbers: AJM44878.1, AJM44883.1, and AJM44886.1. The complete phylogenetic analyses were performed by ETE 3 Toolkit [81], using Maximum Likelihood (ML) with the JTT+G+I substitution model and a bootstrapping of 100. Specific command line is described in Section S12. Gene IDs were replaced by “human-readable” names in Figure S5. A version with the database IDs is provided in Supplementary File 2.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.