Statistical validation of the QSAR models

DJ Dejun Jiang
TL Tailong Lei
ZW Zhe Wang
CS Chao Shen
DC Dongsheng Cao
TH Tingjun Hou
ask Ask a question
Favorite

According to the Organization for Economic Co-operation and Development (OECD) Validation Principle [4, 100], a QSAR model should be associated with appropriate measures of goodness-of-fit, robustness, and predictivity. Here, the random fivefold cross-validation of the training set was used to evaluate the robustness of each model, and the external validation given by the predictions to the test set was used to measure the actual predictivity of each model. In addition to the random fivefold cross-validation, the performance of the seven studied ML algorithms was also evaluated by the cluster cross-validation method proposed by Mayr et al. [101, 102]. The agglomerative hierarchical clustering using complete linkage was used to identify different compound clusters. The distance between any two compounds was measured by the Tanimoto similarity based on the PubChem fingerprints. The maximum distance between any identified clusters was set to 0.7 and the generated smaller compound clusters were distributed across the fivefolds randomly. The following statistical parameters based on the confusion matrix were used to assess the performance of each model: confusion matrix [true positives (TP), true negatives (TN), false positives (FP), false negatives (FN)], global accuracy (GA), balanced accuracy (BA), Matthews correlation coefficient (MCC), and the area under the receiver operating characteristic (ROC) curve (AUC). The GA, BA and MCC are defined as following:

Moreover, the residual (the binary cross entropy was used as the residual function in this study) distribution plots for the fivefold cross-validated training set and the test set were used to further diagnose the quality of the classification models. Different from the confusion matrix based statistical parameters which only supports the predicted class of a compound, it is capable of providing information about the predicted class probability distribution of a compound. For example, for an inhibitor: I [the class probability distribution is (0,1)] and a non-inhibitor: NI [the class probability distribution is (1,0)], the predictions (class probability) for I and NI given by model A are (0.4, 0.6) and (0.65, 0.35), respectively and those given by model B are (0.1, 0.9) and (0.95, 0.05), respectively. Both models give the same confusion matrix and the statistical parameters would be identical [for the majority of ML approaches, the class of a certain compound was derived from its class probability distribution. If the probability for a certain class is larger than or equal to a certain threshold (generally 0.5), this compound would be predicted as the corresponding class], which model should be used in this case? Actually, we still believe that model B is better than model A because the residual to the referenced class probability distribution of model B is smaller than that of model A. In this study, the binary cross entropy was used as the residual function, which was defined as:

where y is the true class label of compound (1 for inhibitor and 0 for non-inhibitor); p is the probability of being inhibitor for a compound and the greater the binary cross entropy is, the larger the residual is, and vice versa.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A