Model Selection and Evaluation

To obtain a predictive classification model, we first evaluated several learning algorithms, including random forest, extremely randomized trees, gradient boosting, k-nearest neighbors, and extreme gradient boosting, using the implementation available on the Scikit-learn library.53 Random forest was the one with the best predictive performance on 10-fold cross-validation over the training set after greedy feature selection. Therefore, the random forest classifier was used as our final model.

Random forest is a powerful ensemble-learning algorithm that generates multiple models of decision trees from a randomly chosen subset of the training set. It then aggregates the votes from various decision trees to determine the most voted class of the test object.54 The predictive model was trained and assessed using 10-fold cross-validation and a non-redundant blind test. The model performance was evaluated using different evaluation metrics, which include accuracy, precision, and the area under the ROC curve (AUC). AUC is an effective measure to evaluate a model’s performance in a classification task at various threshold settings. It is based on the ROC curve, which is plotted with the true positive rate (TPR) against the false positive rate (FPR). Higher AUC means that the model is robust and capable of discriminating between the two classes: active and inactive. AUC uses values between 0 and 1. Therefore, an accurate model would have an AUC of 1, and an AUC of 0.5 indicates that the model is a random classifier.

In the regression counterpart of this work, we also analyzed different regression supervised learning methods to build 74 models for predicting the GI50% values, including gradient boosting, extreme gradient boosting, random forest, extremely randomized trees, and adaptive boosting, which were applied via the Scikit-learn library.55 Pearson’s correlation coefficient, RMSE, and Kendall’s correlation coefficient were employed to select the model with the best performance (Table S2) after greedy feature selection. A 10-fold cross-validation procedure and non-redundant blind test were employed to evaluate the performance of the predictive models. To examine the effect of potential outliers, the model’s performance was evaluated on 100% and also on 90% of the data, which can be interpreted as the full data set minus the 10% worst predicted data points (i.e., the points that fall away from the regression line). For the datasets, the ensemble methods, extremely randomized trees and random forest, were found to be the best performing algorithms (Table S2).

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.