4.3. Nested cross‐validation

Zijie J. Wang; Alex J. Walsh; Melissa C. Skala; Anthony Gitter

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

4.3. Nested cross‐validation

ZW Zijie J. Wang

AW Alex J. Walsh

MS Melissa C. Skala

AG Anthony Gitter

This method is extracted from research article: J Biophotonics, Dec 2019

Classifying T cell activity in autofluorescence intensity images with convolutional neural networks

DOI: 10.1002/jbio.201960050

Ask a question

Favorite

We trained and evaluated eight classifiers of increasing complexity (Table (Table3).3). We used the same leave‐one‐donor‐out test principle to measure the performance of all models. For example, when using donor 1 as the test donor, the frequency classifier counts the positive proportion among all images in the augmented dataset from donors 2, 3, 5 and 6. Then, it uses this frequency to predict the activity for all unaugmented images from donor 1. By testing in this way, the classification result tells us how well each model performs on images from new donors. Donor 4 was not included in this cross‐validation because we randomly selected it as a complete hold‐out donor. All images from donor 4 were only used after hyper‐parameter tuning and model selection as a final independent test to assess the generalizability of our pipeline to a new donor.

The eight classifiers with their input features and hyper‐parameters

Following the leave‐one‐donor‐out test principle 31, 38, we wanted the selection of the optimal hyper‐parameters to be generalizable to new donors as well. Therefore, we applied a nested cross‐validation scheme 55, 56 (Figure (Figure8).8). For each test donor, within the inner loop we performed 4‐fold cross‐validation to measure the average performance of each hyper‐parameter combination (grid search). Each fold in the inner loop cross‐validation corresponds to one donor's augmented images. The outer cross‐validation loop used the selected hyper‐parameters from the inner loop cross‐validation to train a new model with the four other donors' augmented images. We evaluated the trained model on the outer loop test donor. For models requiring early stopping, we constructed an early stopping set by randomly sampling one‐fourth of the unaugmented images from the training set and removing their augmented copies. Then, training continued as long as the performance on images in the early stopping set improved. Similarly, we did not include augmented images in the validation set or the test set.

5 × 4 nested cross‐validation scheme. For each test donor (blue), we used an inner cross‐validation loop to optimize the hyper‐parameters. We trained a model for each hyper‐parameter combination using the training donors' augmented images (yellow) and selected the hyper‐parameters that performed best on the validation donor's images (green). The validation donor is sometimes referred to as a tuning donor in cross‐validation. Then, we trained a final model for each test donor using the selected hyper‐parameters

No single evaluation metric can capture all the strengths and weaknesses of a classifier, especially because our dataset was class imbalanced and not skewed in the same way for all donors. Therefore, we considered multiple evaluation metrics in the outer loop. Accuracy measures the percentage of correct predictions. It is easy to interpret, but it does not necessarily characterize a useful classifier. For example, when positive samples are rare, a trivial classifier that predicts all samples as negative yields high accuracy. Precision and recall (sensitivity), on the other hand, consider the costs of false positive and false negative predictions, respectively. Graphical metrics such as the ROC curve and PR curve avoid setting a specific classification threshold. We used AUC to summarize ROC curves and average precision for the PR curves. The ROC curve performance of a random classifier is independent of the class distribution, while the PR curve is useful when the classes are imbalanced 39. For this reason, we used mean average precision of the inner loop 4‐fold cross‐validation to select optimal hyper‐parameters.

During the nested cross‐validation, we trained the LeNet CNN and pre‐trained CNN with fine‐tuning using GPUs. These jobs ran on GTX 1080, GTX 1080 Ti, K40, K80, P100 or RTX 2080 Ti GPUs. All other models were trained using CPUs.

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol