XGBoost algorithm

DH Dalal Hammoudi Halat
AA Abdel-Salam G. Abdel-Salam
AB Ahmed Bensaid
AS Abderrezzaq Soltani
LA Lama Alsarraj
RD Roua Dalli
AM Ahmed Malki
ask Ask a question
Favorite

XGBoost is a highly regarded machine learning algorithm that excels in both regression and classification problem-solving. Its optimized and distributed gradient-boosting framework offers exceptional efficiency, flexibility, and portability. In the current study, we deployed XGBoost to scrutinize the variables affecting student retention, progression, and graduation within our cohort of health major students.

The initial phase involved gathering an extensive dataset for the students. This dataset comprised demographic details (gender and nationality) and key academic performance indicators (student college, high school GPA, first achieved cumulative GPA, and whether a student had the first year of health major as common year). These factors were selected due to their potential impact on student outcomes. The target variable in the dataset was whether a student retained, progressed, or graduated within a designated timeframe.

The collected student data underwent a process of preparation and feature engineering, making it ready for analysis. Subsequently, we leveraged XGBoost to construct a predictive model. This algorithm builds a predictive ensemble model by iteratively combining weak decision tree models, a process grounded in the principle of boosting. Each successive decision tree in the ensemble is designed to rectify the inaccuracies of its predecessor. By examining the individual decision trees and their interplay, XGBoost captures the intricate relationships between the initial predictors and student outcomes.

During the model training phase, XGBoost monitors each feature’s usage frequency in pivotal decisions across all decision trees. The algorithm generates a ‘feature importance score’ for each predictor by aggregating these statistics. This score signifies the relative contribution of a feature to the model’s overall predictive strength. It is computed by tallying the total gain of each feature across all the decision trees in the ensemble, where ‘gain’ represents the enhancement in the model’s objective function achieved by bifurcating the data based on a specific feature. By scrutinizing the feature importance scores, the most impactful predictors in the model were identified. Features that bear higher scores were indicative of a more pronounced influence on student retention, progression, or graduation in health majors, hence understanding the critical factors that drive student outcomes. In our experiment, we used a grid search of parameters and cross-validation to ensure the best training and validation performance. We set also the parameter related to the class distribution. This parameter is used to manage the imbalance in classification problems where the class distribution is skewed. It is a form of scale factor applied to the positive class in binary classification, and its purpose is to give more emphasis to the minority class.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A