Statistical validation of the QSAR models

Dejun Jiang; Tailong Lei; Zhe Wang; Chao Shen; Dongsheng Cao; Tingjun Hou

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Statistical validation of the QSAR models

DJ Dejun Jiang

TL Tailong Lei

ZW Zhe Wang

CS Chao Shen

DC Dongsheng Cao

TH Tingjun Hou

This method is extracted from research article: J Cheminform, Mar 2020

ADMET evaluation in drug discovery. 20. Prediction of breast cancer resistance protein inhibition through machine learning

DOI: 10.1186/s13321-020-00421-y

Ask a question

Favorite

According to the Organization for Economic Co-operation and Development (OECD) Validation Principle [4, 100], a QSAR model should be associated with appropriate measures of goodness-of-fit, robustness, and predictivity. Here, the random fivefold cross-validation of the training set was used to evaluate the robustness of each model, and the external validation given by the predictions to the test set was used to measure the actual predictivity of each model. In addition to the random fivefold cross-validation, the performance of the seven studied ML algorithms was also evaluated by the cluster cross-validation method proposed by Mayr et al. [101, 102]. The agglomerative hierarchical clustering using complete linkage was used to identify different compound clusters. The distance between any two compounds was measured by the Tanimoto similarity based on the PubChem fingerprints. The maximum distance between any identified clusters was set to 0.7 and the generated smaller compound clusters were distributed across the fivefolds randomly. The following statistical parameters based on the confusion matrix were used to assess the performance of each model: confusion matrix [true positives (TP), true negatives (TN), false positives (FP), false negatives (FN)], global accuracy (GA), balanced accuracy (BA), Matthews correlation coefficient (MCC), and the area under the receiver operating characteristic (ROC) curve (AUC). The GA, BA and MCC are defined as following:

Moreover, the residual (the binary cross entropy was used as the residual function in this study) distribution plots for the fivefold cross-validated training set and the test set were used to further diagnose the quality of the classification models. Different from the confusion matrix based statistical parameters which only supports the predicted class of a compound, it is capable of providing information about the predicted class probability distribution of a compound. For example, for an inhibitor: I [the class probability distribution is (0,1)] and a non-inhibitor: NI [the class probability distribution is (1,0)], the predictions (class probability) for I and NI given by model A are (0.4, 0.6) and (0.65, 0.35), respectively and those given by model B are (0.1, 0.9) and (0.95, 0.05), respectively. Both models give the same confusion matrix and the statistical parameters would be identical [for the majority of ML approaches, the class of a certain compound was derived from its class probability distribution. If the probability for a certain class is larger than or equal to a certain threshold (generally 0.5), this compound would be predicted as the corresponding class], which model should be used in this case? Actually, we still believe that model B is better than model A because the residual to the referenced class probability distribution of model B is smaller than that of model A. In this study, the binary cross entropy was used as the residual function, which was defined as:

where y is the true class label of compound (1 for inhibitor and 0 for non-inhibitor); p is the probability of being inhibitor for a compound and the greater the binary cross entropy is, the larger the residual is, and vice versa.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol