Logistic regression feature selection

ZA Zhila Esna Ashari
ND Nairanjana Dasgupta
KB Kelly A. Brayton
SB Shira L. Broschat
request Request a Protocol
ask Ask a question
Favorite

After reducing the dimensions of our predictor set and calculating the effective factors, we used them to build a binary logistic regression model for using a fast backward feature selection method. As we have two classes of responses (effector and non-effector), binary logistic regression is a suitable analysis method. Logistic function input can be any real number and its output takes a value between 0 and 1, representing the probability of being an effector. The logistic function format is given by Eq (1).

For this step, we used Minitab software to build a logistic regression predictor model for testing our calculated factors and to determine which ones were the most effective based on the built model. We used factors as independent variables and constructed a logistic regression model for each of the four bacteria types. Also, the Hosmer-Lemeshow test, which is a goodness-of-fit test, was used to evaluate our model to ascertain how well our predicted model matches the expected model and predicts the effectors. It works by grouping the input dataset of effectors and non-effectors based on estimated probabilities of being an effector. Most software groups data into deciles, using 10 percent of the data in each group, which is the case for our work. Then the model is used to predict whether they are effectors or non-effectors. The percentage of expected and observed results that are in concordance are then calculated.

Considering the logistic function in Eq (1), we see that it associates a coefficient with each independent variable, and the ones with larger coefficients are more effective in the model. First, we built our logistic regression models such that we did not have complete separation between effectors and non-effectors based on the factors which happens readily for small datasets. In this way we were able to discern the most informative factors and eliminate the least informative ones. We then built a logistic regression model again and evaluated the effectiveness of the remaining factors. We continued until the concordance rate from the Hosmer-Lemeshow test stays acceptable and greater than 90%. In this way, the set of factors working most effectively to predict effector proteins was selected.

In the final step, as discussed in the factor analysis section, we used the factor loadings to determine the set of original features that were selected from the selected factors. If we assume that each original feature is represented by the factor with the greatest loading, we know the set of original features that each factor represents. In this way, we created the group of selected features for each of the four types of pathogens.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A