Logistic regression feature selection

Zhila Esna Ashari; Nairanjana Dasgupta; Kelly A. Brayton; Shira L. Broschat

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Logistic regression feature selection

ZA Zhila Esna Ashari

ND Nairanjana Dasgupta

KB Kelly A. Brayton

SB Shira L. Broschat

This method is extracted from research article: PLoS One, May 2018

An optimal set of features for predicting type IV secretion system effector proteins for a subset of species based on a multi-level feature selection approach

DOI: 10.1371/journal.pone.0197041

Request a Protocol

Ask a question

Favorite

After reducing the dimensions of our predictor set and calculating the effective factors, we used them to build a binary logistic regression model for using a fast backward feature selection method. As we have two classes of responses (effector and non-effector), binary logistic regression is a suitable analysis method. Logistic function input can be any real number and its output takes a value between 0 and 1, representing the probability of being an effector. The logistic function format is given by Eq (1).

For this step, we used Minitab software to build a logistic regression predictor model for testing our calculated factors and to determine which ones were the most effective based on the built model. We used factors as independent variables and constructed a logistic regression model for each of the four bacteria types. Also, the Hosmer-Lemeshow test, which is a goodness-of-fit test, was used to evaluate our model to ascertain how well our predicted model matches the expected model and predicts the effectors. It works by grouping the input dataset of effectors and non-effectors based on estimated probabilities of being an effector. Most software groups data into deciles, using 10 percent of the data in each group, which is the case for our work. Then the model is used to predict whether they are effectors or non-effectors. The percentage of expected and observed results that are in concordance are then calculated.

Considering the logistic function in Eq (1), we see that it associates a coefficient with each independent variable, and the ones with larger coefficients are more effective in the model. First, we built our logistic regression models such that we did not have complete separation between effectors and non-effectors based on the factors which happens readily for small datasets. In this way we were able to discern the most informative factors and eliminate the least informative ones. We then built a logistic regression model again and evaluated the effectiveness of the remaining factors. We continued until the concordance rate from the Hosmer-Lemeshow test stays acceptable and greater than 90%. In this way, the set of factors working most effectively to predict effector proteins was selected.

In the final step, as discussed in the factor analysis section, we used the factor loadings to determine the set of original features that were selected from the selected factors. If we assume that each original feature is represented by the factor with the greatest loading, we know the set of original features that each factor represents. In this way, we created the group of selected features for each of the four types of pathogens.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol