XGBoost algorithm

Dalal Hammoudi Halat; Abdel-Salam G. Abdel-Salam; Ahmed Bensaid; Abderrezzaq Soltani; Lama Alsarraj; Roua Dalli; Ahmed Malki

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

XGBoost algorithm

DH Dalal Hammoudi Halat

AA Abdel-Salam G. Abdel-Salam

AB Ahmed Bensaid

AS Abderrezzaq Soltani

LA Lama Alsarraj

RD Roua Dalli

AM Ahmed Malki

This method is extracted from research article: BMC Med Educ, Nov 2023

Use of machine learning to assess factors affecting progression, retention, and graduation in first-year health professions students in Qatar: a longitudinal study

DOI: 10.1186/s12909-023-04887-w

Ask a question

Favorite

XGBoost is a highly regarded machine learning algorithm that excels in both regression and classification problem-solving. Its optimized and distributed gradient-boosting framework offers exceptional efficiency, flexibility, and portability. In the current study, we deployed XGBoost to scrutinize the variables affecting student retention, progression, and graduation within our cohort of health major students.

The initial phase involved gathering an extensive dataset for the students. This dataset comprised demographic details (gender and nationality) and key academic performance indicators (student college, high school GPA, first achieved cumulative GPA, and whether a student had the first year of health major as common year). These factors were selected due to their potential impact on student outcomes. The target variable in the dataset was whether a student retained, progressed, or graduated within a designated timeframe.

The collected student data underwent a process of preparation and feature engineering, making it ready for analysis. Subsequently, we leveraged XGBoost to construct a predictive model. This algorithm builds a predictive ensemble model by iteratively combining weak decision tree models, a process grounded in the principle of boosting. Each successive decision tree in the ensemble is designed to rectify the inaccuracies of its predecessor. By examining the individual decision trees and their interplay, XGBoost captures the intricate relationships between the initial predictors and student outcomes.

During the model training phase, XGBoost monitors each feature’s usage frequency in pivotal decisions across all decision trees. The algorithm generates a ‘feature importance score’ for each predictor by aggregating these statistics. This score signifies the relative contribution of a feature to the model’s overall predictive strength. It is computed by tallying the total gain of each feature across all the decision trees in the ensemble, where ‘gain’ represents the enhancement in the model’s objective function achieved by bifurcating the data based on a specific feature. By scrutinizing the feature importance scores, the most impactful predictors in the model were identified. Features that bear higher scores were indicative of a more pronounced influence on student retention, progression, or graduation in health majors, hence understanding the critical factors that drive student outcomes. In our experiment, we used a grid search of parameters and cross-validation to ensure the best training and validation performance. We set also the parameter related to the class distribution. This parameter is used to manage the imbalance in classification problems where the class distribution is skewed. It is a form of scale factor applied to the positive class in binary classification, and its purpose is to give more emphasis to the minority class.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol