Filtration of intronic regions
Conserved long-range base pairings are associated with pre-mRNA processing of human genes
Nat Commun, Apr 16, 2021;

Procedure

The intronic regions were defined as the longest continuous segments within genes that do not overlap any annotated exons (including exons of other genes). The intronic regions were extended by 10 nts into the flanking exons to enable the identification of CCRs that overlap splice sites such as regions R1–R5 in the human Ate1 gene36. Next, these regions were intersected with the set of conserved RNA elements ($phastConsElements$ track for the alignment of 99 vertebrates genomes to the human genome99) using bedtools100. We additionally excluded genomic regions that overlap intervals from the UCSC Genome Browser tRNA track101105, sno/miRNA track106112, TFBS Conserved track, which consists of Conserved Transcription Factor Binding Sites generated using the Transfac Matrix and Factor databases113, RepeatMasker track114 (all categories including SINE (short interspersed nuclear elements) and LINE (long interspersed nuclear elements)), SimpleTandemRepeats located by Tandem Repeats Finder115, and Human Nuclear mitochondrial sequences116118. Genomic regions marked as “snRNA,” “miRNA,” “miscRNA,” “snoRNA,” “rRNA,” and “tRNA” in GENCODE genome annotation were also excluded. The resulting set of CIRs was used as an input for the next step, which is described below.

