2.6. Biomarker evaluation protocol

LG Leonardo Gutiérrez-Gómez
JV Jakub Vohryzek
BC Benjamin Chiêm
PB Philipp S. Baumann
PC Philippe Conus
KC Kim Do Cuenod
PH Patric Hagmann
JD Jean-Charles Delvenne
request Request a Protocol
ask Ask a question
Favorite

Our evaluation methodology is based on Abeel et al. 2010 (Abeel et al., 2010) used for biomarker identification in cancer diagnosis on microarray data. In order to assess the robustness of the biomarker selection process, we generate slight variations of the dataset and compare the outcome of selected features across these variations. Therefore, for a stable marker selection algorithm, small variations in the training set should not produce important changes in the retrieved set of features.

We perform a nested 5-fold cross-validation (CV) approach. Here, the external CV is used to provide an unbiased estimate of the performance of the method, whereas the inner CV loop is used to fitting, tunning and selecting the optimal parameters of the model. Concretely, we generate 100 subsamplings of the original dataset, shuffling the outer 5-fold CV scheme 20 times. The 80% of the data, i.e., four folds of the outer CV (pink color in Fig. 1), is used as training set within the inner CV, where the best model and features are selected. That is, four folds are used as training set and the held-out fold as validation set to tune the parameters of the model. The model achieving the best performance on the validation set is selected together with the features selected by the RFE-SVM method. The remaining 20% of the outer CV, i.e., the hold-out fold, is used as testing set to provide an unbiased evaluation of the final model and assess the performance of the classifier. Therefore, the overall accuracy is given by the average testing accuracy across subsamplings. See Fig. 1 for a schematic view of the methodology.

Overview of the proposed method. The figure represents the nested 5-fold CV subsampling of the entire dataset, i.e., top-left gray bar. (Left) The outer CV is used to evaluate the performance of the model. The 80% of the data, i.e., four folds (pink box), is used as training set, where the best model and features are selected. The remaining 20% is used as testing set, to evaluate the performance of the model. (Right) Within the inner CV, four folds are used for training and the hold-out fold as validation set. The best model, features and parameters are selected according with the best CV accuracy. The outer CV is shuffled 20 times, generating 100 subsamplings of the dataset and therefore the same number of selected features ‘fingerprints’. The stability of selected biomarkers and the final accuracy is assessed over all subsampling estimations.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A