Motif discovery from mutation effect predictions

JV Joseph D. Valencia
DH David A. Hendrix
ZZ Zhaolei Zhang
JM Jian Ma
ZZ Zhaolei Zhang
ZZ Zhaolei Zhang
request Request a Protocol
ask Ask a question
Favorite

To uncover sequence elements salient to bioseq2seq predictions, we converted ISM scores into importance scores for the endogenous characters. In particular, we set the importance score of an endogenous base with respect to a given class as equal to the absolute value of ΔS for the strongest mutation in the direction of the counterfactual class, following [82] which used the equivalent from regression models for visualizing importance. For example, an endogenous xi within an mRNA was defined as contributing towards a true positive classification of 〈PC〉 to the extent that substituting any of the three alternate bases in position i produces a highly negative ΔS, which pushes the prediction towards a false negative of 〈NC〉. We calculated importance using both classes on all transcripts. For instance, we looked for strong local contributions towards a prediction of 〈PC〉 within annotated lncRNAs.

For a given importance setting, we then extracted a window of 10 nt upstream and 10 nt downstream around the position with the highest importance score for a total length of 21 nt. This process was run separately for mRNA 5’ and 3’ UTRs and CDS sequences, and similarly for lncRNAs using the longest ORF and its upstream and downstream regions. We used the STREME motif discovery tool to efficiently identify sequence motifs occurring frequently in these regions of interest [57]. STREME estimates p-values for motifs, and after collecting all discovered sequence logos, we reported all that were significant at the 0.001 level after applying the Bonferroni correction for multiple testing.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A