2.3. Feature Extraction Methods

Xiao Wang; Hui Li; Qiuwen Zhang; Rong Wang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.3. Feature Extraction Methods

XW Xiao Wang

HL Hui Li

QZ Qiuwen Zhang

RW Rong Wang

This method is extracted from research article: Biomed Res Int, Apr 2016

Predicting Subcellular Localization of Apoptosis Proteins Combining GO Features of Homologous Proteins and Distance Weighted KNN Classifier

DOI: 10.1155/2016/1793272

Request a Protocol

Ask a question

Favorite

Although the GO-based methods have been proved to exhibit excellent performance in the prediction of subcellular locations, there is some controversy or confusion about using this approach. If a protein has already been annotated with the cellular component GO terms, why does one need to predict its subcellular location? Is it merely a procedure of converting the annotation from one format into another? Some facts are shown to illustrate these questions. All the existing benchmark data sets of the existing predictors for protein subcellular localization prediction were established based on the proteins in the Swiss-Prot database, in which their subcellular location information was determined by experiments. Does it mean that outputs from these predictors are not prediction? No, it does not. In fact, for GO and non-GO predictors, by inputting a query protein sequence, without adding any GO information, the output is its subcellular location(s). In other words, as far as the requirement for the input is concerned, there is no difference at all between the non-GO-approach predictors and GO-approach predictors [53]. The good performance of GO-based methods is due to the fact that the features vectors in the GO space can better reflect their subcellular locations than those in the Euclidean space or any other simple geometric space [54]. And our previous work [33] also strongly supports the legitimacy of using GO information for subcellular localization prediction. Other studies [24, 55] have demonstrated that solving the prediction problem by creating a lookup table using the cellular component GO terms and the cellular component categories is not desirable and has very poor prediction performance.

According to our previous work [33, 56], we first compress and reorganize the GO numbers in GO database (released on 20 June 2015), because the GO number is not continuous. We map GO numbers to GO_compress numbers and create a new database called GO_compress database. The new database is used to store the data after processing.

As time goes on, the number of GO terms is increasing rapidly. It is impossible to use all of the GO terms used to generate the feature vector; otherwise, it will face high dimensional data disaster. In this study, GO terms marked “cell component” in GO database are selected, which contains 3951 GO numbers. We deal with these GO numbers using the above methods.

The protein P is represented as

where f _u are defined as follows.

BLAST was used to search the Swiss-Prot (released on 24 July 2015) and find the homologous proteins of P and these homologous proteins are collected into a set. The proteins in the set are seen as “representative proteins” of P, sharing some similar attributes such as structural conformations and biological functions.

If the set is null, that is, P has no homologous proteins, or homologous proteins have no GO numbers, only use the P itself to search the GO database, find the corresponding GO number(s), and then convert the GO numbers to their GO_compress numbers. We have mentioned that an AC of protein in Uniprot/Swiss-Prot may correspond to 0, 1, or more GO number(s); the relationship between AC and the GO numbers may be one-to-many. If the set is not null, use the P and the homologous proteins in the set to search the GO database, find the corresponding GO number(s), and then convert the GO numbers to their GO_compress numbers. We find that the results of predicting are different with using different number of homologous proteins in the set. We will conduct a detailed description in the following.

f _u is defined as

where N _P ^h is the number of P and the homologies in the set; if jth representative protein hits the uth GO_compress number, then θ(u, j) = 1; otherwise, θ(u, j) = 0. All proteins in the data set have been annotated by GO database; GO numbers of proteins can be found in GOA database; it will not appear that the feature vector created by using this method is naught vector under the condition that the number of the homologous proteins is 0.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol