Published: Vol 10, Iss 9, May 5, 2020 DOI: 10.21769/BioProtoc.3600 Views: 5054
Reviewed by: Prashanth N SuravajhalaJayaraman ValadiL N Chavali
Protocol Collections
Comprehensive collections of detailed, peer-reviewed protocols focusing on specific topics
Related protocols
TGIRT-seq Protocol for the Comprehensive Profiling of Coding and Non-coding RNA Biotypes in Cellular, Extracellular Vesicle, and Plasma RNAs
Hengyi Xu [...] Alan M. Lambowitz
Dec 5, 2021 5511 Views
Optimizing Transmembrane Protein Assemblies in Nanodiscs for Structural Studies: A Comprehensive Manual
Fernando Vilela [...] Dorit Hanein
Nov 5, 2024 1537 Views
Protein Structural Characterization Using Electron Transfer Dissociation and Hydrogen Exchange-Mass Spectrometry
Rupam Bhattacharjee and Jayant B. Udgaonkar
Jun 20, 2025 1036 Views
Abstract
Template-based modeling, the process of predicting the tertiary structure of a protein by using homologous protein structures, is useful when good templates can be available. Indeed, modern homology detection methods can find remote homologs with high sensitivity. However, the accuracy of template-based models generated from the homology-detection-based alignments is often lower than that from ideal alignments. In this study, we propose a new method that generates pairwise sequence alignments for more accurate template-based modeling. Our method trains a machine learning model using the structural alignment of known homologs. When calculating sequence alignments, instead of a fixed substitution matrix, this method dynamically predicts a substitution score from the trained model.
Background
Proteins are key molecules in biology, biochemistry and pharmaceutical sciences. To reveal the functions of proteins, it is essential to understand the relationships between proteins' structure and function. Protein structures can be determined by experimental; the protein structures are often registered to and accessible in the Protein Databank (PDB) (wwPDB consortium, 2018). However, despite improvements in experimental methods for determining protein structures, the speed at which amino acid sequences can be revealed has overtaken our ability to ascertain the corresponding proteins' structures (Muhammed et al. 2019). Therefore, protein structure prediction remains essential.
As one of various methods for protein structure prediction, template-based or homology modeling predicts structures based on templates and their sequence alignment to a target protein. Template structures are the structures of homologous proteins, often found by homology detection methods. Currently, template-based modeling methods are the most practical because the predicted models are often accurate if we can find good templates and protein sequence alignments. These accurate models by template-based modeling can be used for computer-aided drug design (CADD).
Indeed, recent homology search methods have been able to detect remote homologs (Boratyn et al., 2012; Zimmermann et al., 2018). Although, sometimes sufficiently accurate structure models cannot be obtained because the quality of the sequence alignment generated by homology detection program is poor. If a more accurate model is required, researchers must manually edit alignments to improve their quality before modeling. In structural alignment, the structural difference between a target protein structure and a template protein structure is minimized; thus, sequence alignments generated by structural alignment are almost ideal for template-based modeling. Often, the sequence alignments generated by the homology detection methods are dissimilar to those generated by structural alignment, especially for remote homologs. Thus far, a method’s ability to detect remote homologs has been prioritized because models cannot be generated without a template. However, to achieve higher-accuracy template-based modeling, the improvement of sequence alignment generation is a critical open problem. This problem has been mentioned in several studies (Kopp et al., 2007) in which researchers have tried to improve alignments manually based on their knowledge of biology; fully automated methods are still required.
Recently, machine learning methods have demonstrated power in various fields (Lyons et al., 2014; Cao et al., 2016; Wang, Peng, et al., 2016; Wei and Zou, 2016; Manavalan and Lee, 2017; Wang, Sun, et al., 2017). Machine learning also seems effective in tackling the problem of alignment generation for homology modeling. However, this topic has not been studied because it is challenging to treat alignment generation as a classification or regression problem.
For the problem, we proposed a new sequence alignment generation protocol based on a machine learning that learns the structural alignments of known homologs (Makigaki and Ishida, 2019). We use a dynamic programming algorithm during aligning sequences to dynamically predict a substitution score from the k-Nearest Neighbor (k-NN) model instead of a fixed substitution matrix or profile comparison. Machine learning is used in this substitution score prediction process.
The proposed method is valuable for researchers who use template-based modeling with remote homologs whose sequence identity is not high. In this paper, we show the overview of our method as a procedure, and more detailed usage of our tool and some examples are available in the source code repository (https://github.com/shuichiro-makigaki/exmachina).
Equipment
Software
Procedure
The primary purpose of the training phase is to generate k-NN model that will be used for substitution score prediction in the prediction and alignment generation phase. The prediction phase consists of score prediction and alignment generation. Figure 1 shows the overview of the method. More detailed step-by-step commands and the example are available at source code repository (https://github.com/shuichiro-makigaki/exmachina).
Figure 1. Overview of the proposed method
Acknowledgments
This work was supported by JSPS KAKENHI [18K11524] and (Makigaki and Ishida, 2019).
Competing interests
The authors declare no competing interests.
References
Article Information
Copyright
© 2020 The Authors; exclusive licensee Bio-protocol LLC.
How to cite
Makigaki, S. and Ishida, T. (2020). Sequence Alignment Using Machine Learning for Accurate Template-based Protein Structure Prediction. Bio-protocol 10(9): e3600. DOI: 10.21769/BioProtoc.3600.
Category
Systems Biology > Transcriptomics > RNA-seq
Biochemistry > Protein > Structure
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link