Advanced Search
Last updated date: Feb 23, 2022 Views: 645 Forks: 0
Protocol for “Annotating Putative D. discoideum Proteins Using I-TASSER” Ryan Rahman
Texas A&M University
The objective of this protocol is to provide a tool (one of many) that can be implemented to characterize the functions of poorly understood gene products that have no gene ontology (GO) annotation. One such method implemented by the lab of Dr. Richard Gomer is inputting protein FASTA sequences into a comparative suite of bioinformatic programs called I-TASSER, which was developed and is currently maintained by the Zhang lab at the University of Michigan.
Important Links:
Stepwise protocol:
Identifying candidate genes/gene products
4. A prerequisite for I-TASSER is the assumption that the genome of the model organism of choice has been sequenced and these data are readily available. As an example, I will illustrate the steps for Dictyostelium discoideum.
5. Navigate to the model organism database (Dictybase.org for D. discoideum) and identify unannotated genes (either those of interest obtained through proteomic, transcriptomic, or genomic data or by contacting the curators of said database).
b. Verify that the gene of interest is unannotated on the database (DictyBase) and perform a BLASTp (protein-to-protein) search. BLAST, or Basic Local Alignment Search Tool, compares the inputted sequence to all those stored by NCBI to find those that share the most sequence similarity (read more about the exact details of how BLAST works). If there is a great hit with high similarity, then it will have a very small e value (which is similar to a p-value in that the lower it is, the more statistically significant the result is). If there is a hit that is well-characterized in another organism, then an annotation can already be written based on the functional information about the other gene product, which is essentially an ortholog of your protein. If there is a low e value hit, or multiple hits, but all the hits are ‘hypothetical protein’ then what you are looking at is a protein of unknown function in Dicty that does need to be investigated using I-TASSER. Here is the link to NCBI BLASTp.
6. Include your top 5 BLASThits in the final annotation of your gene/gene product (if the e valueis below the widely accepted threshold value of 0.01). If BLAST yields very few results, or just hypothetical proteins, or more information is desired, proceed with I-TASSER using the following website: I-TASSER server for protein structure and function prediction
(zhanggroup.org).
7. To utilize this resource, registration is required and can take a couple of days to process (it is free of charge, however as long as you register with a .edu email address). After receiving the confirmation email and password for your account, login on the home page to submit the sequence for your protein of interest.
c. With the use of the best threading templates, several SPICKER cluster simulations are run to create the top 5 three-dimensional models which are then given C-scores. “C-score is a confidence score for estimating the quality of predicted models by I-TASSER. It is calculated based on the significance of threading template alignments and the convergence parameters of the structure assembly simulations. C-score is typically in the range of [-5,2], where a C-score of higher value signifies a model with a high confidence and vice-versa.”
d. The most informative and functional data given by I-TASSER are displayed in the following sections of the example output: “Proteins structurally close to the target in the PDB” and “Predicted function using COFACTOR and COACH”.
i. The first subsection illustrates the best final I-TASSER model (“Model 1” from above), and I-TASSER compares its structure to all x-ray crystallography structures in the PDB using TM-Align. These are the proteins with the most structural similarity to your gene product of interest! Therefore, the proteins with the highest TM-score are most likely to share the same function with your gene product of interest since it is mostly accepted that form follows function, especially when it comes to proteins. Click on the hyperlinks to the PDB hits and note the predicted functions and protein families/classes of the top five hits assuming these hits have a TM-score > 0.50 (the accepted cutoff for TM- scores).
ii. The final subsections include predictions of motifs such as ligand-binding sites, catalytic/active sites, and the gene ontology of your gene product of interest. Again, utilize the C-scores and GO-scores to assess which hits are accurate for your gene product. Information concerning the calculation of these statistical measures can be found in the Zhang lab’s 2015 Nature Methods paper, and a concise annotation of the example output is also available here.If you prefer an even more technical protocol about how to use I-TASSER, use this link.
e. Lastly, please remember to cite the Zhang lab in all publications associated with I- TASSER:
i. You are requested to cite following articles when you use the I-TASSER server:
Final Annotations
Protein ID
Statistical cutoffs for I-TASSER
2. An example is provided here as a bulleted list for DDB_G0271132:
DDB_G0271132
Annotation:
○ Predicted secondary structure: 35-36 separate alpha helices, 7-8 beta strands, and the rest of the residues are associated with coils.
○ Most of the amino acid residues are buried, implying that they are likely hydrophobic and could be components of a transmembrane channel.
○ I-TASSER uses LOMETS to access the PDB library and perform complex multithreading to compare tens of thousands of template alignments to the query sequence. The top 5 threading templates are given by the codes: 6kzoA, 5fvmA, 6r9tA, 1vt4, and 1vt4A. These are listed in descending rank, with normalized Z-scores of 1.50, 1.68, 1.39, 1.38, and 1.17 respectively. Coverage values (calculated as number of structurally aligned residues divided by length of query) for each hit were found to be 0.94, 0.97, 0.98, 0.61, 0.17.
○ The top 5 protein structural analogs were identified by the PDB database by the codes 6kzoA, 6uz0A, 3ir7A, 1sijA, 1dgjA, which corresponds to a human voltage-gated calcium channel (membrane protein), a sodium channel (membrane protein) for cardiac action potentials, a transmembrane oxidoreductase, an oxidoreductase (aldehyde dehydrogenase), and an oxidoreductase, respectively.
○ The top 4 ligand binding sites are given by the ligand names MG, FES, ZN, and MG which correspond to magnesium, and iron-sulfur cluster, zinc, and magnesium with C-scores of 0.07, 0.02, 0.02, and 0.02 respectively.
○ The top 5 enzyme commission (active sites) PDB hits were given by the codes 2j5wA, 1g8kA, 1kgfA, 1dgjA, and 1h0hA with TM-scores of 0.270, 0.261, 0.253, 0.277, and 0.247 respectively in descending rank. These correspond to
a human metal-cation binding oxidoreductase, an argenite oxidase, a dehydrogenase, an aldehyde oxidoreductase, and a formate dehydrogenase electron transport protein respectively.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.
Share
Bluesky
X
Copy link