The goodness of fit (reliability) plot depicting agreement between the observed proportion of viral suppression and predicted probability of virological suppression were generated for each model [41]. In this plot the range of predicted probabilities was discretized into 20 intervals. The mean predicted probability and the associated observed proportion of viral suppression in each interval were calculated and plotted. The points should be near to the diagonal if the model is well calibrated, otherwise the model would be misspecified [42–44]. A corresponding sharpness diagram was plotted to show the distribution of the different probability categories used to generate the reliability plot. The root mean squared error (RMSE) with respect to the identity line was also calculated.
The calculated probabilities were used to assess the overall predictive performance of the models by calculating the means squared error (MSE), also known as the brier score [45]. Since the outcome prevalence in the test datasets used for the MTLR and PSSP was 0.47, a brier score of less than 0.245 was considered as satisfactory predictive performance. For the SLR model, the outcome prevalence in the dataset was 0.115 thus a brier score less than 0.102 was considered satisfactory [41].
The model’s discriminative ability was assessed by generating a receiver operator characteristics curve and the corresponding c-statistic (AUROC), and the precision-recall curve and the corresponding area under the precision recall curve (AUPRC) using the non-parametric method [46]. A c-statistic is a measure of the ability of the model to correctly classify those with and without the outcome. C-statistic values of 0.5–0.7,0.7-0.79,0.8–0.89 and > 0.9 were considered, poor, moderate, good and excellent predictions respectively [47]. An AUPRC value above 0.47 was considered satisfactory. The F1 score, which is the harmonic mean of precision and recall, was also calculated. The closer the F1 score to 1 the higher the discriminative ability of the model while values close to 0 meant poor discrimination [48].
The Youden indices (J-statistic) of each model was obtained by searching among plausible values of the predicted probability of outcome for which the sum of sensitivity and specificity was a maximum [49]. For any task, if the patient’s predicted probability was above the obtained J-statistic, viral suppression was predicted to occur therefore the J-statistic was considered the decision boundary (cut-point) between low and high probability patients [50, 51].
Using the cut-point, the performance of the models outside the studied population, setting and period, also known as temporo-spatial transportability was assessed on the EFV cohort dataset, to ensure practical applicability of the model [52]. The shared tasks between the IDI and EFV datasets were day 1, 84, 112, 140 and 168 and model transportability was tested only on these tasks.
The models were used to predict probability of suppression in the EFV dataset. The prediction accuracy, sensitivity, selectivity, positive negative predictive value and positive predictive value were generated for each model.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.