The dataset for this study was drawn from the UK Biobank. The UK Biobank was established as a major prospective study with significant involvement from the UK Medical Research Council and the Wellcome Trust21, and has become an important open-access resource for medical researchers across the UK and worldwide. The subset of data contained 150,000 records in the training set and 50,000 records in the validation set, each record representing a distinct individual. The variables in the dataset were as follows (UK Biobank field number in brackets): Demographic: Sex (31), Age (33), Smoking (20116), Ethnicity (21000), Townsend Deprivation Index (189); Investigatory: Haemoglobin (30020), Glycated Haemoglobin (30750), Body Mass Index (21001), Weight (21002), Body Fat % (23099); Medical Diagnoses: Diabetes (2443), High blood pressure (6150), Heart Attack / Angina / Stroke (6150), Blood Clot / Emphysema / Lung Clot, Asthma, Hayfever/Rhinitis/Eczema (6152), Other Serious Condition (2473).
The condition of interest in this study was Type II diabetes: globally a leading, and increasing, cause of morbidity and mortality51, predicted to become the most prevalent condition in the UK Biobank cohort21. Diabetes was chosen in order to test the framework with a realistic medical problem, and a selection of variables of ethical interest. It was expected that the diagnosis of diabetes, and by association raised levels of HbA1c, would be predictable from the data. This was known to be plausible due to existing work52 on diabetes prediction using UK Biobank data, which influenced the variable selection. The dataset was used for the regression task of predicting the level of glycated haemoglobin (HbA1c) using the other variables, excluding the presence of a diabetes diagnosis.
To prepare the data, a small number of records with what appeared to be outlier values of HbA1c were removed. Some other variables of interest were dropped including Income and Forced Expiratory Volume, owing to missing data. The ethnicity codes in the data were also grouped into broader categories for ease of illustration. The diagnoses of heart attack, angina and stroke were combined into one variable due to the format of the original data and low prevalence, as was Blood Clot/Emphysema/Lung Clot and Hayfever/Rhinitis/Eczema for the same reasons. Any records with missing values were removed; the number of records above refers to complete records.
The UK Biobank project was approved by the National Research Ethics Service Committee North West-Haydock (REC reference: 11/NW/0382). An electronic signed consent was obtained from the participants.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.