# Also in the Article

Random forest classifier
This protocol is extracted from research article:
Data-driven phenotype discovery of FMR1 premutation carriers in a population-based sample

Procedure

Random forest is a robust, accurate, and reliable classification method with low generalization error and high predictive performance (39). The algorithm repeatedly draws a bootstrap sample (random sampling with replacement) and trains an ensemble of decision trees, one tree per sample. To further ensure diversity among the trees in the forest, during training only, a random subset of variables is considered for use at any node in a decision tree. At prediction time (testing time), for a given test, the forest aggregates the predictions from all of the trees and identifies the most popular class as the final prediction (39). Although the current study is based on the largest FMR1-informed biobank derived from population data, the number of cases is still relatively small compared with the number of features in our analyses. This difference elevates the risk of overfitting the training data, a risk that is also raised by use of a nonlinear model, such as the decision trees in a random forest. Nevertheless, such nonlinear models provide an opportunity to find important multivariate interactions in the data. They enabled us to find predictive combinations of diagnostic codes that differentiate the two groups. The ensemble nature of random forests in practice reduces the risk of overfitting. The random forest method can be validly applied to studies in which the number of cases is much smaller than number of input features (39, 40). To measure the success of classification, the AUROC was reported. The ROC curve displays the false-positive rate versus the true-positive rate. AUROC of 1.00 shows 100% success in classification, and AUROC of 0.5 represents random classification (26). To ensure that the ROC curve is not overly optimistic, it was constructed by stratified 10-fold cross-validation, the form of hold-out testing widely used throughout machine learning for this same purpose. In stratified 10-fold cross-validation, cases and controls were each randomly partitioned into 10 parts, and on each fold, a different single part of the cases and of the controls were held aside for testing. A predictive model (in our case, random forest) was trained on the remaining $910$ of the data, tested on the held-out $110$, and an ROC curve was constructed from the test set. The 10 final resulting ROC curves were vertically averaged to avoid any assumption of calibration between folds. Because a single ROC curve was returned by the overall method, no adjustment for multiple comparisons was necessary for the curve or the P value resulting from the Mann-Whitney U test based on it.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A