Using the Boruta algorithm to assess feature importance

Joshua Traynelis; Michael Silk; Quanli Wang; Samuel F. Berkovic; Liping Liu; David B. Ascher; David J. Balding; Slavé Petrovski

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Using the Boruta algorithm to assess feature importance

JT Joshua Traynelis

MS Michael Silk

QW Quanli Wang

SB Samuel F. Berkovic

LL Liping Liu

DA David B. Ascher

DB David J. Balding

SP Slavé Petrovski

This method is extracted from research article: Genome Res, Oct 2017

Optimizing genomic medicine in epilepsy through a gene-customized approach to missense variant interpretation

DOI: 10.1101/gr.226589.117

Request a Protocol

Ask a question

Favorite

In the previous steps, we removed two features due to near-zero variance and an additional nine due to high correlations (Pearson's |r| > 0.75). We then adopted the Boruta algorithm (R package Boruta) for random forest classifiers to evaluate which of the 20 remaining bioinformatic tools are predictive of pathogenicity in a given gene (Kursa and Rudnicki 2010). The Boruta algorithm adopts an “all-relevant” feature importance assessment using a robust permutation-based approach to identify features that are in some circumstances relevant to the classification outcome of interest, rather than attempting to achieve a minimal subset of features. Boruta judges importance by a feature's ability to outperform randomized instances of all the studied true features (referred to as shadow features). Shadow features are obtained for all 20 bioinformatic features by randomly shuffling each original feature's values across the observations, repeatedly. Informative features are then defined as features with a random forest Z-score distribution above that of the highest performing randomized feature (i.e., max shadow feature) (Kursa and Rudnicki 2010). The Z-scores reflected the mean decrease accuracy measure in R's randomForest function. Within the R package, we set our seed to be 15; to increase our confidence, we set our maxRuns to represent 1000 random forest runs (an order of magnitude greater than the default setting) and used the R randomForest default settings of ntree = 500 and mtry = 4. Thus, for a given gene, only the features that consistently achieve higher importance scores (Z-scores) than the Z-score distribution from the best-performing (max) shadow feature across all the random forest runs were selected as informative (Fig. 4).

We sought to minimize circularity in our feature evaluations by relying only on ExAC v1 and ExAC v2 singleton and rare (MAF < 0.05%) variants that would not have had major contribution to the training sets of these features. Taking the most consistently top ranked feature VEST 3.0, in its training it adopted missense variants with a minor allele frequency >1% among the Exome Sequencing Project (ESP6500) and 47,000 HGMD pathogenic missense variants (Carter et al. 2013). Although our study design sought to alleviate concerns about the total impact of circularity affecting feature evaluations, this effect remains an important consideration (Grimm et al. 2015).

This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol