Evaluation Metrics and Testing

MH Moritz Helsper
AA Aashish Agarwal
AA Ahmet Aker
AH Annika Herten
MD Marvin Darkwah-Oppong
OG Oliver Gembruch
CD Cornelius Deuschl
MF Michael Forsting
PD Philipp Dammann
DP Daniela Pierscianek
RJ Ramazan Jabbarli
US Ulrich Sure
KW Karsten Henning Wrede
request Request a Protocol
ask Ask a question
Favorite

The overall accuracy, which is the proportion of test examples, is the metric that is most widely used to evaluate a classifier's performance. When a dataset is imbalanced, the accuracy will favor the overrepresented classes. This leads to misclassification. A measure of quality that addresses these issues is the AUROC (area under receiver operator characteristic). We used the AUROC as the main metric to compare the performance of classifiers trained with our datasets. However, for the purpose of evaluation, we also report precision (ratio of correctly predicted positive observations to the total predicted positive observations), recall (ratio of correctly predicted positive observations to all the observations in actual class), and F1 score (harmonic mean of precision and recall) along with the AUROC values.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A