Also in the Article



We classified gene models into 3 main types: protein-coding, pseudogene, and long non-coding RNA (lncRNA) using alignment qualities of all supporting data for each model. Models with alignments to known proteins, having little or no overlaps with repeat regions of the genome, having high intron support and well-characterized canonical splice junctions were classified as protein-coding. Pseudogenes were annotated by identifying genes with alignments to known proteins but with evidence of frame-shifting or located in repeat regions of the genome. Single-exon models with a corresponding multi-exon copy elsewhere in the genome were classified as processed pseudogenes. Gene models generated using transcriptomic data (short and long reads), lacking any protein supporting evidence and did not overlap a protein-coding locus were classified as lncRNA.

Small non-coding RNA identification: Small non-coding (sncRNA) genes were added using annotations taken from RFAM [75] and miRbase [76]. BLAST [77] was run for these sequences to identify homologs in the genome sequence and models were evaluated for expected stem-loop structures using RNAfold [78]. Additional machine learning-based filters were applied to exclude predictions with sub-optimal alignments to the genome and non-conforming secondary structures. For other sncRNAs, models were built using the Infernal software suite [79].

Note: The content above has been extracted from a research article, so it may not display correctly.



Also in the Article

Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.