Binary logistic regression is used to predict the odds which is defined as the probability of an event happened divided by the probability that the event not happened. The advantages of the logistic regression are its simplicity, fast training speed, and the widely use of log odds for investigating the relative risk of various predictors on the binary outcome. However, the model does not allow missing data and it cannot detect a non-linear structure automatically and adaptively inherited the non-linear structure in the model.
Random forest is an ensemble method which aims to enhance the model performance by combining many weak classifiers such as decision trees. Given a training set, random forest first generates many bootstrap samples as the training set. Then a decision tree is built for each bootstrap sample using a subset of predictors randomly selected to consider splitting in each node. Finally, taking the average of the predicted probabilities of the binary outcome obtained from these fitted trees gives the predicted probability for the fitted random forest. Random forest models can be trained fairly quickly because of the inherent parallel computing. Besides, unlike other machine learning models, its randomness avoids the training to get stuck at a local minimum; hence, it can be made more complex to improve the prediction accuracy without the risk of overfitting.
Support vector machine takes each data point as a vector in m-dimensional space (where m is the number of variables) with the value of each variable being the value of a particular coordinate. Then, it is capable to differentiate different classes by identifying the hyper-plane. It is not hard to find a linear hyper-plane between two classes; however, many cases are non-linear. The most significant benefit of SVM comes from the fact that they are not restricted to being linear classifiers, where it contains functions that can take low dimensional input space and transform it to a higher dimensional space, hence the algorithm become much more flexible by introducing various types of non-linear decision boundaries.
The XGBoost is an efficient and scalable implementation of gradient boosting framework by J. Friedman (Friedman, 2002; Chen and Guestrin, 2016). XGBoost is now a widely used and popular machine learning technique among data scientists’ communities. It is an ensemble technique that builds the model in a stage-wise method that new models are added to correct the errors made by the previously trained models. New models are added sequentially until no further improvement can be made. It is a highly flexible and versatile approach that can work through most regression, classification, and ranking tasks as well as customized objective functions.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.