Random forest training

Martin Vallières; Emily Kay-Rivest; Léo Jean Perrin; Xavier Liem; Christophe Furstoss; Hugo J. W. L. Aerts; Nader Khaouam; Phuc Felix Nguyen-Tan; Chang-Shu Wang; Khalil Sultanem; Jan Seuntjens; Issam El Naqa

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Random forest training

MV Martin Vallières

EK Emily Kay-Rivest

LP Léo Jean Perrin

XL Xavier Liem

CF Christophe Furstoss

HA Hugo J. W. L. Aerts

NK Nader Khaouam

PN Phuc Felix Nguyen-Tan

CW Chang-Shu Wang

KS Khalil Sultanem

JS Jan Seuntjens

IN Issam El Naqa

This method is extracted from research article: Sci Rep, Aug 2017

Radiomics strategies for risk assessment of tumour failure in head-and-neck cancer

DOI: 10.1038/s41598-017-10371-5

Request a Protocol

Ask a question

Favorite

The process of random forest training inherently uses bootstrapping in order to train the multiple decision trees of the forest. Conventionally, one different decision tree is trained for each bootstrap sample. In this work, we used 100 bootstrap samples to train each random forest constructed from the training set (H&N1 and H&N2 cohorts; n = 194). For each bootstrap sample, the imbalance-adjustment strategy detailed above was used such that each bootstrap sample produced multiple decision trees (one per partition) to be appended to a random forest. Therefore, the final number of decision trees per random forest was dependent on the actual proportion of events in each bootstrap sample for each outcome studied. The three final random forest models developed in this work (italic fonts in Table 1, Supplementary Table ^S4) were constructed using 582, 661 and 518 decision trees for LR, DM and OS, respectively.

In addition to the imbalance-adjustment strategy adopted in this work, under/oversampling of the instances in each partition of an ensemble was used to further correct for data imbalance in the random forest training process. Under/oversampling weights of the minority class of 0.5 to 2 with increments of 0.1 were tested in this work. Stratified random sub-sampling was used to estimate the optimal weight for a given training process (and also to estimate the optimal clinical staging variables to be used) in terms of the maximal average AUC, a process randomly separating the training set of this work into multiple sub-training and sub-testing sets (n = 10) with 2:1 size ratio and equal proportion of events. The final random forest models developed in this work (italic fonts in Table 1, Supplementary Table ^S4) used oversampling weights of 1.4, 1.6 and 1.7 (in conjunction with the previously described imbalance-adjustment strategy) to train the decision trees of the forests for LR, DM and OS, respectively. The overall random forest training process is pictured in Supplementary Fig. ^S7.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol