Since DNA methylation and RNAseq data are informative for age, sex, and white blood cell composition [87–90], we used the data to impute these variables. Missing observations were imputed separately for the RNAseq and DNA methylation data because there is incomplete overlap between the datasets. Missing observations in the measured white blood cell counts (WBCC) were imputed using the R package pls, adjusting for reported age and sex, as described earlier (https://molepi.github.io/DNAmArray_workflow/05_Predict.html) [20]. For missing age and sex measurements, we compared the performance of the elastic net, LASSO, ridge, and pls methods. To evaluate the performance of these models, the data was randomly split into a train set (2/3) and a test set (1/3). This procedure was repeated 25 times, each time calculating the accuracy in the test set (mean squared error for age and F1-score for sex). The above algorithm was performed using varying numbers of input variables (50 to 10,000), where the input variables were selected based on their correlation with the outcome. The model and number of input variables that resulted in the best average accuracy in the test sets were selected to impute missing data. The average correlation between predicted and reported age in the tests sets was 0.98 for the DNA methylation data and 0.92 for the RNAseq data. Sex was almost perfectly predicted (accuracy ≈ 0.995) in both DNA methylation and RNAseq data.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.