The microarray gene expression values of the NCI-60 cell lines are formatted as a matrix, and sub-divided into training (75%) and validation (25%) datasets. Each probe of a gene is analyzed as a separate feature for each sample. We applied SVM on training data to get weights for each feature, and sort the features based on the weights (Table C in S2 File). Models are built using a learning dataset, and evaluated using a test dataset. Linear support vector machine (SVM) is employed recursively as a classification model to separate samples into two classes: sensitive and resistant. The learning function is svmtrain (Matlab R2013b version 8.2.0.701), and the kernel function is linear. The samples are represented as a vector x, and the two classes are divided in the dataspace by a hyperplane wx’ + b = 0 that maximizes the margins between the learning samples of the two classes. This margin is defined such that:
Binary classification is performed for the test prediction. The prediction score for test samples are calculated by using the decision function as follows:
where w and b are the weight vector and bias parameters from the SVM model. The input x is the normalized test sample gene expression data with RFE selected i number of features. The classification of the patient drug response is based on this score. We call a sample a responder to the drug if this score is higher than 0, and a non-responder to the drug if the score is lower than 0.
Recursive feature elimination (RFE) was performed to find the minimum set of features that maximized accuracy in the classification on the test dataset (Fig 5). The approach starts by removing the least relevant 100 features for the model from the bottom (lowest weights) of the sorted feature list. The following SVM model is built using the remaining features, and then again removes the 100 features with lowest weights. This process proceeds recursively until the number of remaining features reaches 100. Thereafter, features are removed one at a time until the most informative set of features is obtained [28–30]. If there are multiple models with the highest accuracy, the model with the fewest number of features is selected. Each model is forced to contain a minimum of ten probes. The predictive model for each drug is based on the most informative set of features determined for that drug. Leave one out cross-validation (LOOCV) is used to evaluate the performance of each of the models as previously described [14].
This approach takes the microarray expression data of NCI-60 cancer cell lines as input data, and the output is a model with the most informative features.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.