Although the GO-based methods have been proved to exhibit excellent performance in the prediction of subcellular locations, there is some controversy or confusion about using this approach. If a protein has already been annotated with the cellular component GO terms, why does one need to predict its subcellular location? Is it merely a procedure of converting the annotation from one format into another? Some facts are shown to illustrate these questions. All the existing benchmark data sets of the existing predictors for protein subcellular localization prediction were established based on the proteins in the Swiss-Prot database, in which their subcellular location information was determined by experiments. Does it mean that outputs from these predictors are not prediction? No, it does not. In fact, for GO and non-GO predictors, by inputting a query protein sequence, without adding any GO information, the output is its subcellular location(s). In other words, as far as the requirement for the input is concerned, there is no difference at all between the non-GO-approach predictors and GO-approach predictors [53]. The good performance of GO-based methods is due to the fact that the features vectors in the GO space can better reflect their subcellular locations than those in the Euclidean space or any other simple geometric space [54]. And our previous work [33] also strongly supports the legitimacy of using GO information for subcellular localization prediction. Other studies [24, 55] have demonstrated that solving the prediction problem by creating a lookup table using the cellular component GO terms and the cellular component categories is not desirable and has very poor prediction performance.
According to our previous work [33, 56], we first compress and reorganize the GO numbers in GO database (released on 20 June 2015), because the GO number is not continuous. We map GO numbers to GO_compress numbers and create a new database called GO_compress database. The new database is used to store the data after processing.
As time goes on, the number of GO terms is increasing rapidly. It is impossible to use all of the GO terms used to generate the feature vector; otherwise, it will face high dimensional data disaster. In this study, GO terms marked “cell component” in GO database are selected, which contains 3951 GO numbers. We deal with these GO numbers using the above methods.
The protein P is represented as
where f u are defined as follows.
BLAST was used to search the Swiss-Prot (released on 24 July 2015) and find the homologous proteins of P and these homologous proteins are collected into a set. The proteins in the set are seen as “representative proteins” of P, sharing some similar attributes such as structural conformations and biological functions.
If the set is null, that is, P has no homologous proteins, or homologous proteins have no GO numbers, only use the P itself to search the GO database, find the corresponding GO number(s), and then convert the GO numbers to their GO_compress numbers. We have mentioned that an AC of protein in Uniprot/Swiss-Prot may correspond to 0, 1, or more GO number(s); the relationship between AC and the GO numbers may be one-to-many. If the set is not null, use the P and the homologous proteins in the set to search the GO database, find the corresponding GO number(s), and then convert the GO numbers to their GO_compress numbers. We find that the results of predicting are different with using different number of homologous proteins in the set. We will conduct a detailed description in the following.
f u is defined as
where N P h is the number of P and the homologies in the set; if jth representative protein hits the uth GO_compress number, then θ(u, j) = 1; otherwise, θ(u, j) = 0. All proteins in the data set have been annotated by GO database; GO numbers of proteins can be found in GOA database; it will not appear that the feature vector created by using this method is naught vector under the condition that the number of the homologous proteins is 0.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.