Imputation

Paul J. Hop; René Luijk; Lucia Daxinger; Maarten van Iterson; Koen F. Dekkers; Rick Jansen; Joyce B. J. van Meurs; Peter A. C. ’t Hoen; M. Arfan Ikram; Marleen M. J. van Greevenbroek; Dorret I. Boomsma; P. Eline Slagboom; Jan H. Veldink; Erik W. van Zwet; Bastiaan T. Heijmans

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Imputation

PH Paul J. Hop

RL René Luijk

LD Lucia Daxinger

MI Maarten van Iterson

KD Koen F. Dekkers

RJ Rick Jansen

JM Joyce B. J. van Meurs

PH Peter A. C. ’t Hoen

MI M. Arfan Ikram

MG Marleen M. J. van Greevenbroek

DB Dorret I. Boomsma

PS P. Eline Slagboom

JV Jan H. Veldink

EZ Erik W. van Zwet

BH Bastiaan T. Heijmans

This method is extracted from research article: Genome Biol, Aug 2020

Genome-wide identification of genes regulating DNA methylation using genetic anchors for causal inference

DOI: 10.1186/s13059-020-02114-z

Request a Protocol

Ask a question

Favorite

Since DNA methylation and RNAseq data are informative for age, sex, and white blood cell composition [87–90], we used the data to impute these variables. Missing observations were imputed separately for the RNAseq and DNA methylation data because there is incomplete overlap between the datasets. Missing observations in the measured white blood cell counts (WBCC) were imputed using the R package pls, adjusting for reported age and sex, as described earlier (https://molepi.github.io/DNAmArray_workflow/05_Predict.html) [20]. For missing age and sex measurements, we compared the performance of the elastic net, LASSO, ridge, and pls methods. To evaluate the performance of these models, the data was randomly split into a train set (2/3) and a test set (1/3). This procedure was repeated 25 times, each time calculating the accuracy in the test set (mean squared error for age and F₁-score for sex). The above algorithm was performed using varying numbers of input variables (50 to 10,000), where the input variables were selected based on their correlation with the outcome. The model and number of input variables that resulted in the best average accuracy in the test sets were selected to impute missing data. The average correlation between predicted and reported age in the tests sets was 0.98 for the DNA methylation data and 0.92 for the RNAseq data. Sex was almost perfectly predicted (accuracy ≈ 0.995) in both DNA methylation and RNAseq data.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol