Gene Space Evolution

MM Martin Malmstrøm
RB Ralf Britz
MM Michael Matschiner
OT Ole K Tørresen
RH Renny Kurnia Hadiaty
NY Norsham Yaakob
HT Heok Hui Tan
KJ Kjetill Sigurd Jakobsen
WS Walter Salzburger
LR Lukas Rüber
request Request a Protocol
ask Ask a question
Favorite

In order to assess the changes of gene-, exon-, and intron sizes in Paedocypris, we first identified the proteomic overlap of Paedocypris, D. rerio, Di. nigroviridis, and T. rubripes by running the software OrthoFinder (Emms and Kelly 2015) on the complete protein sets of these five species. We used the full protein sets from Ensembl (Cunningham et al. 2015) (v. 80): D. rerio (GRCz10), Di. nigroviridis (TETRAODON8), and T. rubripes (FUGU4). However, as some of the D. rerio genes have more than one protein or transcript in the Ensembl database, the output from BioMart (31,953 genes and 57,349 proteins) was filtered so that only the longest protein sequence from each gene was used in the analysis, and genes without protein sequences were removed. This resulted in a set of 25,460 genes with a single protein prediction. For the two Paedocypris species, the “standard” gene sets resulting from the annotation were used as input. These sets were filtered to include only genes with AED (Annotation Edit Distance) scores < 1 or those with a Pfam domain. By using only genes belonging to the 10,368 orthogroups found to contain orthologs from all these species (fig. 2a), we obtained a comprehensive but conservative data set as the basis for these analyses. Information about each of the corresponding genes in D. rerio, Di. nigroviridis, and T. rubripes was obtained from BioMart, and included the Ensembl gene and protein ID, and the chromosome name in addition to the start and stop position for each gene, transcript, and exon. Intron sizes were then calculated on the basis of exon positions, using a custom script (“gene_stats_from_BioMart.rb”). In some cases, the sum of exons and introns did not equal the total length of a gene, which appears to be caused by inconsistency in the registration of UTR regions in the Ensembl database for individual genes. In these cases, to be conservative with regard to intron length estimates, the gene length was shortened to correspond to the sum of the exons and the corresponding introns between these.

Intron and exon lengths for the two Paedocypris species were calculated in a similar manner, but on the basis of the “standard” filtered annotation file in “gff” format produced as part of the annotation pipeline. In addition, for these species, the intron lengths were determined on the basis of identified exons, with another custom script (“gene_stats_from_gff.rb”). Gene-, exon-, and intron length histograms were plotted with the R package ggplot2.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A