Validation and testing

Janani Durairaj; Elena Melillo; Harro J. Bouwmeester; Jules Beekwilder; Dick de Ridder; Aalt D. J. van Dijk

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Validation and testing

JD Janani Durairaj

EM Elena Melillo

HB Harro J. Bouwmeester

JB Jules Beekwilder

DR Dick de Ridder

AD Aalt D. J. van Dijk

This method is extracted from research article: PLoS Comput Biol, Mar 2021

Integrating structure-based machine learning and co-evolution to investigate specificity in plant sesquiterpene synthases

DOI: 10.1371/journal.pcbi.1008197

Request a Protocol

Ask a question

Favorite

Three validation schemes are used to test a classifier.

Random Split: A random five-fold cross-validation with 80%-20% train-test split.

Genus Split: A scheme in which cases from 65 genera are used for training and the rest for testing, repeated 10 times with different sets.

Clade Split: All dicot STSs are used for training and monocot and conifer STSs for testing.

Three different metrics are used to measure the performance of each classifier, using the definitions of TP and TN as the number of nerolidyl cation-specific synthases and number of farnesyl cation-specific synthases predicted correctly at a certain threshold of predicted probability, and FP and FN as the number of nerolidyl cation-specific synthases and number of farnesyl cation-specific synthases predicted incorrectly at a certain threshold. All metrics are calculated using the scikit-learn Python library [47].

Balanced accuracy (bAcc): $\frac{1}{2} (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})$ at a threshold of 0.5.

Area Under the Receiver Operating Characteristic Curve (AUC): Calculated as the area under the plot of the fraction of TP out of the total number of nerolidyl cation-specific synthases vs. the fraction of FP out of the total number of farnesyl cation-specific synthases, at various threshold settings.

Area Under the Precision-Recall Curve (AUPRC): Calculated as the area under the plot of the precision (TP/(TP + FP)) vs. the recall (TP/(TP + FN) at various threshold settings.

42 newly characterized synthases from literature (listed in S1 Table) are used as the final independent test set.

Copyright and License information: ©2021 Durairaj et al ©2021 Tucker-Kellogg, ElofssonThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. ©2021 Tucker-Kellogg, ElofssonThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. ©2021 Tucker-Kellogg, ElofssonThis is an open access article distributed under the terms of the , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol