Seq2Fun phase II: translate reads and map to protein database

PL Peng Liu
JE Jessica Ewald
JG Jose Hector Galvez
JH Jessica Head
DC Doug Crump
GB Guillaume Bourque
NB Niladri Basu
JX Jianguo Xia
request Request a Protocol
ask Ask a question
Favorite

First, each clean read is translated into amino acid sequences using six reading frames from both directions, which typically results in dozens of peptide fragments. At most, the top six longest fragments are kept for the translated search in MEM mode, although it could generate more fragments with the BLOSUM62 scores (Henikoff and Henikoff 1992) for Greedy mode (Supplemental Fig. S1) but still far fewer than the original algorithm implemented in Kaiju v1.7.3 (Menzel et al. 2016). Although keeping the top longest fragments could remove some true protein fragments, the same methods are applied to all samples and therefore should have a minimal impact on any downstream comparative analysis such as differential expression analysis. In some cases, only keeping the longest fragments could filter out a proportion of fragments that originated from the merged PE reads if start and stop codons are present in the middle of the reads. Therefore, we recap the maximum cutoff length of peptide fragment to be 60 aa (by default though, the user can change this cutoff), which will prevent the filtering out of some true peptide fragments from the merged reads.

Next, the peptide fragments are aligned to the Seq2Fun database, which consists of protein sequences from KEGG pathway genes that were retrieved using the KEGGREST R package v1.12.2 (https://bioconductor.org/packages/KEGGREST/). The size of the protein database was reduced by removing redundant protein sequences that have >99% similarity across species using CD-HIT v4.8.1 (Supplemental Tables S1, S2; Li and Godzik 2006; Fu et al. 2012). Seq2Fun employs the same core reads alignment algorithm as Kaiju v1.7., which has two different modes designed for species with (MEM mode) and without (Greedy mode) a reference genome (Menzel et al. 2016). The MEM mode only allows exact matches between query and subject sequences from the database. It enables a fast search in the database, and the fragment with the longest matching length is retained. It is designed for organisms with reference genomes in the database. Therefore, in this study, MEM mode was used to map RNA-seq reads from mouse, chicken, zebrafish, and roundworm to their own species-specific protein references (e.g., the mouse database for MEM mode consists of 8438 mouse-specific protein sequences) (Supplemental Tables S1, S2), respectively, in order to demonstrate the feasibility of MEM mode. The downside of MEM mode is that it cannot identify homologous protein sequences of the query if there is even a single discrepancy with the sequences in the database. Therefore, the Greedy mode is introduced to allow a small number of amino acid mismatches between the query and subject, which helps Seq2Fun handle evolutionary divergence between species. The peptide fragments are aligned to a database that contains sequences from many different species, and the fragments with the highest BLOSUM62 scores (Henikoff and Henikoff 1992) are retained. A detailed description of the MEM and Greedy modes is available in the Supplemental Materials. In this study, four databases with genome exclusion for Greedy mode were created for mouse (e.g., 64 mammal species excluding mouse), chicken, zebrafish, and roundworm, respectively (Table 1; Supplemental Tables S1, S2), to mimic conditions for analyzing data from an organism without a reference genome. This phase produces a reads-protein ID map.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A