The overall accuracy, which is the proportion of test examples, is the metric that is most widely used to evaluate a classifier's performance. When a dataset is imbalanced, the accuracy will favor the overrepresented classes. This leads to misclassification. A measure of quality that addresses these issues is the AUROC (area under receiver operator characteristic). We used the AUROC as the main metric to compare the performance of classifiers trained with our datasets. However, for the purpose of evaluation, we also report precision (ratio of correctly predicted positive observations to the total predicted positive observations), recall (ratio of correctly predicted positive observations to all the observations in actual class), and F1 score (harmonic mean of precision and recall) along with the AUROC values.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.