Based on the feature extraction methods described in section Feature extraction, We extracted a 188-dimensional, 400-dimensional feature set based on sequence information, and a 473-dimensional data set based on sequence and secondary structure information representing the entire bacteriophage protein sequence dataset. Some redundant or irrelevant cases were still present in these features. The existence of invalid features wastes time and computational resources, and affects the classification accuracy of the model (Chen et al., 2018b,f,g; Dao et al., 2018; Yang et al., 2018; Zhu et al., 2018a,b). In this paper, the Max-Relevance-Max-Distance (MRMD) (Zou et al., 2016) method was used to select features and identify higher-quality feature sets, i.e., the optimal feature subset. In this method, Pearson's correlation coefficient is used to calculate the correlation between features and class labels (MR), thus enabling the selection of features with strong correlation to the target class. Three distance functions (Euclidean and cosine distances and the Tanimoto coefficient) are used to calculate the redundancy between features (MD) and identify features with low redundancy.
Taking the two eigenvectors (X,Y) as an example, Pearson's correlation coefficient (Pearson, 1909) expressed as follows:
Where σX and σY denote the standard deviation of the two vectors, cov(X, Y) is the covariance, which is used to measure the relationship between two random variables. The covariance formula is as follows:
Where and denote the mean of the respective vectors.
The formula for the Euclidean distance (Larson and Edwards, 1991; Deza and Deza, 2009) is:
Where M is the number of feature vectors,n is the total number of elements in each vector, and xq, yq are the q-th elements in X, Y, respectively.
The cosine distance formula (Tan et al., 2005) is:
Where
The Tanimoto coefficient (Rogers and Tanimoto, 1960) is given by:
Using these distance metrics, we identified the features with the strongest correlation and minimum redundancy with respect to the class labels. In different scenarios, we can increase the weights of MR and MD (max(wr × MRi + wd × MDi)) to ensure the acquired features are suitable for the classification task.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.