The predicted mRNAs were firstly translated into protein sequences. MMseqs2 (32) program suite was employed for further protein sequence analysis; cluster submodule was then used for clustering to reduce redundancy of sequence space, with e-value threshold set to 1e−5. Representative sequences of clusters were compared with those available in structure database (i.e. PDB (52, 53) and AlphaFold Protein Structure Database (31) using search submodule, with e-value threshold set to 1e−5 as well). As a result, 6,800 nonhomologous protein sequences were left for further structural prediction.
Non-Docker version AlphaFold2 (30) was deployed for speed and scalability. Features (i.e. multiple sequence alignments) needed as input for further prediction were firstly generated on a distributed cluster of machines without GPUs. Further structural prediction by neural network and refinement using molecular dynamics were both conducted on machines with graphics cards. Each task was provided with one graphics card to speed up computation. Finally, we obtained 6,798 predicted structures and their relative information, while the prediction for the other two failed due to video memory limitation.
To identify the superfamily of these predicted structures, we used DaliLite.v5 (54) (i.e. a standalone program for protein structural alignment using Dali method) to compare these with representative structures of superfamilies provided by SCOPE (55, 56). The all-against-all structural comparisons were performed with default parameters. The hits with the highest Z-score were considered as the best ones, and thus the superfamilies of query predicted proteins were considered as same as those of best hits.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.