Machine-Learning. SVM: Recursive feature elimination (RFE)

Cai Huang; Roman Mezencev; John F. McDonald; Fredrik Vannberg

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Machine-Learning. SVM: Recursive feature elimination (RFE)

CH Cai Huang

RM Roman Mezencev

JM John F. McDonald

FV Fredrik Vannberg

This method is extracted from research article: PLoS One, Oct 2017

Open source machine-learning algorithms for the prediction of optimal cancer drug therapies

DOI: 10.1371/journal.pone.0186906

Request a Protocol

Ask a question

Favorite

The microarray gene expression values of the NCI-60 cell lines are formatted as a matrix, and sub-divided into training (75%) and validation (25%) datasets. Each probe of a gene is analyzed as a separate feature for each sample. We applied SVM on training data to get weights for each feature, and sort the features based on the weights (Table C in S2 File). Models are built using a learning dataset, and evaluated using a test dataset. Linear support vector machine (SVM) is employed recursively as a classification model to separate samples into two classes: sensitive and resistant. The learning function is svmtrain (Matlab R2013b version 8.2.0.701), and the kernel function is linear. The samples are represented as a vector x, and the two classes are divided in the dataspace by a hyperplane wx’ + b = 0 that maximizes the margins between the learning samples of the two classes. This margin is defined such that:

Binary classification is performed for the test prediction. The prediction score for test samples are calculated by using the decision function as follows:

where w and b are the weight vector and bias parameters from the SVM model. The input x is the normalized test sample gene expression data with RFE selected i number of features. The classification of the patient drug response is based on this score. We call a sample a responder to the drug if this score is higher than 0, and a non-responder to the drug if the score is lower than 0.

Recursive feature elimination (RFE) was performed to find the minimum set of features that maximized accuracy in the classification on the test dataset (Fig 5). The approach starts by removing the least relevant 100 features for the model from the bottom (lowest weights) of the sorted feature list. The following SVM model is built using the remaining features, and then again removes the 100 features with lowest weights. This process proceeds recursively until the number of remaining features reaches 100. Thereafter, features are removed one at a time until the most informative set of features is obtained [28–30]. If there are multiple models with the highest accuracy, the model with the fewest number of features is selected. Each model is forced to contain a minimum of ten probes. The predictive model for each drug is based on the most informative set of features determined for that drug. Leave one out cross-validation (LOOCV) is used to evaluate the performance of each of the models as previously described [14].

This approach takes the microarray expression data of NCI-60 cancer cell lines as input data, and the output is a model with the most informative features.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol