We trained and evaluated eight classifiers of increasing complexity (Table (Table3).3). We used the same leave‐one‐donor‐out test principle to measure the performance of all models. For example, when using donor 1 as the test donor, the frequency classifier counts the positive proportion among all images in the augmented dataset from donors 2, 3, 5 and 6. Then, it uses this frequency to predict the activity for all unaugmented images from donor 1. By testing in this way, the classification result tells us how well each model performs on images from new donors. Donor 4 was not included in this cross‐validation because we randomly selected it as a complete hold‐out donor. All images from donor 4 were only used after hyper‐parameter tuning and model selection as a final independent test to assess the generalizability of our pipeline to a new donor.
The eight classifiers with their input features and hyper‐parameters
Following the leave‐one‐donor‐out test principle 31, 38, we wanted the selection of the optimal hyper‐parameters to be generalizable to new donors as well. Therefore, we applied a nested cross‐validation scheme 55, 56 (Figure (Figure8).8). For each test donor, within the inner loop we performed 4‐fold cross‐validation to measure the average performance of each hyper‐parameter combination (grid search). Each fold in the inner loop cross‐validation corresponds to one donor's augmented images. The outer cross‐validation loop used the selected hyper‐parameters from the inner loop cross‐validation to train a new model with the four other donors' augmented images. We evaluated the trained model on the outer loop test donor. For models requiring early stopping, we constructed an early stopping set by randomly sampling one‐fourth of the unaugmented images from the training set and removing their augmented copies. Then, training continued as long as the performance on images in the early stopping set improved. Similarly, we did not include augmented images in the validation set or the test set.
5 × 4 nested cross‐validation scheme. For each test donor (blue), we used an inner cross‐validation loop to optimize the hyper‐parameters. We trained a model for each hyper‐parameter combination using the training donors' augmented images (yellow) and selected the hyper‐parameters that performed best on the validation donor's images (green). The validation donor is sometimes referred to as a tuning donor in cross‐validation. Then, we trained a final model for each test donor using the selected hyper‐parameters
No single evaluation metric can capture all the strengths and weaknesses of a classifier, especially because our dataset was class imbalanced and not skewed in the same way for all donors. Therefore, we considered multiple evaluation metrics in the outer loop. Accuracy measures the percentage of correct predictions. It is easy to interpret, but it does not necessarily characterize a useful classifier. For example, when positive samples are rare, a trivial classifier that predicts all samples as negative yields high accuracy. Precision and recall (sensitivity), on the other hand, consider the costs of false positive and false negative predictions, respectively. Graphical metrics such as the ROC curve and PR curve avoid setting a specific classification threshold. We used AUC to summarize ROC curves and average precision for the PR curves. The ROC curve performance of a random classifier is independent of the class distribution, while the PR curve is useful when the classes are imbalanced 39. For this reason, we used mean average precision of the inner loop 4‐fold cross‐validation to select optimal hyper‐parameters.
During the nested cross‐validation, we trained the LeNet CNN and pre‐trained CNN with fine‐tuning using GPUs. These jobs ran on GTX 1080, GTX 1080 Ti, K40, K80, P100 or RTX 2080 Ti GPUs. All other models were trained using CPUs.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.