Feature Selection

Xiaoqing Ru; Lihong Li; Chunyu Wang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Feature Selection

XR Xiaoqing Ru

LL Lihong Li

CW Chunyu Wang

This method is extracted from research article: Front Microbiol, Mar 2019

Identification of Phage Viral Proteins With Hybrid Sequence Features

DOI: 10.3389/fmicb.2019.00507

Request a Protocol

Ask a question

Favorite

Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400-dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset. Some redundant or irrelevant cases were still present in these features. The existence of invalid features wastes time and computational resources, and affects the classification accuracy of the model (Chen et al., 2018b,f,g; Dao et al., 2018; Yang et al., 2018; Zhu et al., 2018a,b). In this paper, the Max-Relevance-Max-Distance (MRMD) (Zou et al., 2016) method was used to select features and identify higher-quality feature sets, i.e., the optimal feature subset. In this method, Pearson's correlation coefficient is used to calculate the correlation between features and class labels (MR), thus enabling the selection of features with strong correlation to the target class. Three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) are used to calculate the redundancy between features (MD) and identify features with low redundancy.

Taking the two eigenvectors (X,Y) as an example, Pearson's correlation coefficient (Pearson, 1909) expressed as follows:

Where σ_X and σ_Y denote the standard deviation of the two vectors, cov(X, Y) is the covariance, which is used to measure the relationship between two random variables. The covariance formula is as follows:

Where $\bar{X}$ and $\bar{Y}$ denote the mean of the respective vectors.

The formula for the Euclidean distance (Larson and Edwards, 1991; Deza and Deza, 2009) is:

Where M is the number of feature vectors,n is the total number of elements in each vector, and x_q, y_q are the q-th elements in X, Y, respectively.

The cosine distance formula (Tan et al., 2005) is:

Where

The Tanimoto coefficient (Rogers and Tanimoto, 1960) is given by:

Using these distance metrics, we identified the features with the strongest correlation and minimum redundancy with respect to the class labels. In different scenarios, we can increase the weights of MR and MD (max(wr × MR_i + wd × MD_i)) to ensure the acquired features are suitable for the classification task.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol