Identification of shared circRNA loci between species

Franziska Gruhl; Peggy Janich; Henrik Kaessmann; David Gatfield

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Identification of shared circRNA loci between species

FG Franziska Gruhl

PJ Peggy Janich

HK Henrik Kaessmann

DG David Gatfield

This method is extracted from research article: eLife, Sep 2021

Circular RNA repertoires are associated with evolutionarily young transposable elements

DOI: 10.7554/eLife.67991

Request a Protocol

Ask a question

Favorite

Shared circRNA loci were defined on three different levels depending on whether the ‘parental gene’, the ‘circRNA locus’ in the gene or the ‘start/stop exons’ overlapped between species (see Figure 2A and Figure 2—figure supplement 1A). Overall considerations of this kind have recently also been outlined in Patop et al., 2019.

Level 1 - Parental genes: One-to-one (1:1) therian orthologous genes were defined between opossum, mouse, rat, rhesus macaque and human using the Ensembl orthology annotation (confidence intervals 0 and 1, restricted to clear one-to-one orthologs). The same procedure was performed to retrieve the 1:1 orthologous genes for the eutherians (mouse, rat, rhesus macaque, human), for rodents (mouse, rat), and primates (rhesus macaque, human). Shared circRNA loci between species were assessed by counting the number of 1:1 orthologous parental genes between the five species. The analysis was restricted to protein-coding genes.

Level 2 - circRNA locus: To identify shared circRNA loci, all circRNA exon coordinates from a given gene were collapsed into a single transcript using the bedtools merge option from the BEDTools toolset with default options. Next, we used liftOver to compare exons from the collapsed transcript between species. The minimal ratio of bases that need to overlap for each exon was set to 0.5 (-minMatch=0.5). Collapsed transcripts were defined as overlapping between different species if they shared at least one exon, independent of the exon length.

Level 3 - start/stop exon: To identify circRNAs sharing the same first and last exon between species, we lifted exons coordinates between species (same settings as described above, liftOver, -minMatch=0.5). The circRNA was then defined as ‘shared’, if both exons were annotated as start and stop exons in the respective circRNAs of the given species. Note, that this definition only requires an overlap for start and stop exons, internal circRNA exons may differ.

Given that only circRNAs that comprise corresponding (1:1 orthologous exons) in different species might at least potentially and reasonably considered to be homologous (i.e. might have originated from evolutionary precursors in common ancestors) and the Level 3 definition might require strong evolutionary conservation of splice sites (i.e. with this stringent definition many shared loci may be missed), we decided to use the level 2 definition (circRNA locus) for the analyses presented in the main text, while we still provide the results for the Level 1 and 3 definitions in the supplement (Figure 2—figure supplement 1A). Importantly, defining shared circRNA loci at this level allows us to also compare circRNA hostspots which have been defined using a similar classification strategy.

Based on the species set in which shared circRNA loci were found, we categorised circRNAs in the following groups: species-specific, rodent, primate, eutherian, and therian circRNAs. To be part of the rodent or primate group, the circRNA has to be expressed in both species of the lineage. To be part of the eutherian group, the circRNA has to be expressed in three species out of the four species mouse, rat, rhesus macaque and human. To be part of the therian group, the circRNA needs to be expressed in opossum and in three out of the four other species. Species-specific circRNAs are either present in one species or do not match any of the other four categories. The usage of multiple species for defining shared loci, allowed to define ‘mammalian circRNAs’ with high confidence (Figure 2—figure supplement 1B). To define the different groups, we used the cluster algorithm MCL (Enright et al., 2002; Dongen, 2000). MCL is frequently used to reconstruct orthology clusters based on blast results. It requires input in abc format (file: species.abc), in which a corresponds to event a, b to event b and a numeric value c that provides information on the connection strength between event a and b (e.g. blast p-value). If no p-values are available as in this analysis, the connection strength can be set to 1. MCL was run with a cluster granularity of 2 (option -I).

$ mcxload -abc species.abc –stream-mirror -o species.mci -write-tab species.tab$ mcl species.mci -I 2$ mcxdump -icl out.species.mci.I20 -tabr species.tab -o dump.species.mci.I20

Codings exons were selected based on the attribute ‘transcript_biotype = protein_coding’ in the gtf annotation file of the respective species and labelled as circRNA exons if they were in our circRNA annotation. Exons were further classified into UTR-exons and non-UTR exons using the ensembl field ‘feature = exon’ or ‘feature = UTR’. Since conservation scores are generally lower for UTR-exons (Pollard et al., 2010), any exon labelled as UTR-exon was removed from further analyses to avoid bias when comparing circRNA and non-circRNA exons. Genomic coordinates of the remaining exons were collapsed using the merge command from the BEDtools toolset (bedtools merge input_file -nms -scores collapse) to obtain a list of unique genomic loci. PhastCons scores for all exon types were calculated using the conservation scores provided by the UCSC genome browser (mouse: phastCons scores based on alignment for 60 placental genomes; rat: phastCons scores based on alignment for 13 vertebrate genomes; human: phastCons scores based on alignment for 99 vertebrate genomes). For each gene type (parental or non-parental), the median phastCons score was calculated for each exon type within the gene (if non-parental: median of all exons; if parental: median of exons contained in the circRNA and median of exons outside of the circRNA).

Using the DEXseq package (from HTSeq 0.6.1), reads mapping on coding exons of the parental genes were counted. The exon-bins defined by DEXseq (filtered for bins >=10 nt) were then mapped and translated onto the different exon types: UTR-exons of parental genes, exons of parental genes that are not in a circRNA, circRNA exons. For each exon type, an FPKM value based on the exon length and sequencing depth of the library was calculated.

Exons were labelled as expressed in a tissue, if the calculated FPKM was at least 1. The maximum number of tissues in which each exon occurred was plotted separately for UTR-exons, exons outside the circRNA and contained in it.

The ensembl annotation for each species was used to retrieve the different known transcripts in each coding gene. For each splice site, the GC amplitude was calculated using the last 250 intronic bp and the first 50 exonic bp (several values for the last n intronic bp and the first m exonic bp were tested beforehand, the 250:50 ratio was chosen, because it gave the strongest signal). Splice sites were distinguished by their relative position to the circRNA (flanking, inside or outside). A one-tailed and paired Mann-Whitney U test was used to assess the difference in GC amplitude between circRNA-related splice sites and others.

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol