Statistical analysis

RL Rui Luo
NM Nandini Mukherjee
SC Su Chen
YJ Yu Jiang
SA S Hasan Arshad
JH John W Holloway
AH Anna Hedman
OG Olena Gruzieva
EA Ellika Andolf
GP Goran Pershagen
CA Catarina Almqvist
WK Wilfried JJ Karmaus
ask Ask a question
Favorite

Statistical and biological assessments followed the 5 steps (Figure 1). First, in the discovery cohort, the statistical analyses started with screening for significant CpG sites potentially associated with gestational age at birth by using training-testing screening (ttScreening) (v1.5) package.42 The ttScreening approach has the ability to detect more true positives CpGs than traditional methods that controlled for multiple testing by false discovery rates (FDR), and Bonferroni-based methods. During simulation, sensitivity based on ttScreening is comparable with that from the FDR-based method, but ttScreening provides a higher specificity. Compared with Bonferroni-based method, the ttScreening method showed better sensitivity and comparable specificity. ttScreening uses 100 randomly selected training and testing sub-samples to estimate and test the effects of the primary variable.42 The selection probabilities indicate how often CpGs gained statistical significance both in the 100 training and testing sub-samples. To account for skewed distributions of gestational age at birth with β values, the β values were logit-transformed. In the screening process, the purpose of which is the detection of associations, CpGs were the dependent and gestational age at birth was the independent variable. Potential confounding from the leukocyte cell composition was controlled in the process of ttScreening. The CpG sites which met the 50% cut-off in selection proportion were considered to be important.42

Study flowchart of statistical and biological assessment in the course of the study.

Second, following the screening, the study concentrated on biological pathways to explain the function of the identified CpGs. Function enrichment analysis was conducted for the genes of the discovered CpGs obtained from the methylation label file (Infinium MethylationEPIC v1.0 B4 Manifest File). CpGs for which the gene was not documented in the manifest, the nearest gene names were identified using SNIPPER (https://csg.sph.umich.edu/boehnke/snipper/)43 and the University of California Santa Cruz (UCSC) Genome Browser (https://genome.ucsc.edu/).44 The chromosome number and map info of the CpGs were queried (using Human GRCh37/hg19) and the nearest gene to the site of the CpG was selected. Once all genes were identified, the list of genes was entered into ToppFUN (https://toppgene.cchmc.org/)45 to identify biological pathways related to these genes. Significant pathways (adjusting for multiple testing P value of ⩽.05) were selected for the next steps.

Third, to test whether the biological pathway-related CpGs were associated with gestational age at birth adjusting for confounders, Cox proportional hazards models were applied. Hazard ratios (HRs) were estimated. In this step, to estimate the risk of CpGs for the duration of gestational age, the CpGs were used as independent variables and weeks of gestational age at birth as the dependent variable. We adjusted for the proportion of leukocyte cell-composition, maternal smoking during pregnancy, maternal age at conception, paternal age at conception, and paternal smoking. To minimize false-positive findings, the P values were adjusted for FDR.46 To graphically demonstrate associations, we chose one CpG site with largest HR and plotted the Kaplan-Meier curve of the gestational age at birth versus DNAm. For the figure, DNAm was dichotomized into 2 groups based on their median.

Fourth, to replicate our results, Spearman’s correlation analyses were performed between DNAm levels of the pathway-related CpG sites and gestational age at birth from our cohort and from the replication dataset from Born into Life Study (n = 15). The 95% confidence intervals (CI) of the IoWBC Spearman’s correlation coefficients were obtained and investigated whether the correlation coefficients of the replication cohort were within the respective 95% CI. To better demonstrate the correlation between gestational age at birth and methylation level of pathway-related CpG sites in both IoWBC and Born into Life Study, graphs were plotted by using log-transformed gestational age at birth.

Finally, fifth, to explore the biological implications from pathway-related CpGs, correlations between methylation levels of specific CpGs from offspring and gene expression in offspring’s umbilical cord blood of corresponding genes were estimated in IoWBC. In addition, the correlations were also tested in a subgroup of participants who had available paternal DNAm data, transcript data, and DNAm data for offspring’s umbilical cord blood.

For all analyses, a P value of ⩽.05 was considered as statistically significant. Statistical analyses were performed by R 3.4.4. and SAS 9.4 (SAS Institute, Cary NC).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A