For each disease, we performed a conditional logistic regression model, treating female sex as the exposure and disease code as the outcome, conditioning on examination center, year of participation and year of birth. In this part of the study, we did not analyze diseases in the genital tract. As suggested by previous studies (Li et al. 2018), we only analyzed ICD codes with more than 200 incident disease cases; this resulted in 301 nested case–control studies. The odds ratio (OR) derived in a nested case–control study is theoretically largely representative of the relative risk derived from a cohort study (Goldstein and Langholz 1992). To avoid a very weak association and ensure a significant difference in relative risk between men and women (Jensen et al. 2017; Westergaard et al. 2019), diseases with ORs larger than 1.2 and p value less than 0.05/301(the Bonferroni corrected threshold) were considered as diseases associated with the female sex.
Disease trajectory analysis was proposed to assess the strength and directionality of the association between two diseases (Jensen et al. 2014; Siggaard et al. 2020). In the first step, the association between two incident diseases (denoted by D1 and D2) was tested among women after they participated in the UK Biobank. Women with either D1 or D2 before three months after the date of participation were excluded from the analysis. As we aimed to investigate the female-specific disease trajectories, D1 and D2 should be identified as associated with the female sex in the above analysis or in the female reproductive system. We first constructed the disease pairs (D1-D2) among women. As suggested by previous studies (Han et al. 2021; Siggaard et al. 2020; Yang et al. 2019), only disease pairs with more than 50 women with these two diseases beyond three months after cohort participation were included in the analysis to ensure statistical power. The association between the two diseases was estimated using a case–control study design. We defined women with D2 as cases and randomly matched them with up to 10 participants (controls). The controls were alive and free from D2 at the time when D2 occurred in their matched cases. Cases and controls were individually matched by age at participation, year of birth, and examination center. We then tested whether D2 was associated with D1, using conditional logistic regression models, and calculated the OR to estimate the association between the disease pairs.
In the second step, for each pair of diseases identified in step 1, we used binomial tests to assess the temporal directionality (D1 → D2) of the association, as recommended by Jensen A. et al. (Jensen et al. 2014). The binomial test for D1 → D2 tested whether the probability of a patient with D1 diagnosed before D2 was significantly higher than 50% among patients diagnosed with both D1 and D2. D1 → D2 disease pairs with p values less than the Bonferroni-corrected threshold were then included in the next step of the analysis.
Disease pairs meeting the strength (OR > 1.2) and directionality requirement of the association test were combined into disease trajectories. To detect the subgroups of the identified disease trajectories, we further used the Louvain clustering algorithm to subdivide the network of disease trajectories into clusters. The Louvain algorithm was previously used for community identification in social network analysis, which identifies the areas of the neighbor graph to be more densely connected than the overall connectivity (Iliho and Saritha 2019).
To investigate the associations between preclinical biomarkers and the longitudinal patterns of multi-step disease trajectories, disease pairs meeting the directionality requirement of the association test were first combined into 3-line disease trajectories (e.g. D1 → D2 and D2 → D3 were combined into D1 → D2 → D3) with at least 10 patients passing through this trajectory to avoid chance findings. Mediation analyses were further performed to test the potential causal relationship for the 3-line trajectories treating D1 as exposure, D3 as outcome, and D2 as mediator. We used the method suggested by VanderWeele (VanderWeele 2014), which estimated the overall effect of D1 on D3, in the presence of D2, and was decomposed into direct effect, only mediation, only interaction and both mediation and interaction effects. We also estimated the percent of the total association (on the log-odds scale) between D1 and D3 that was mediated by D2. The 65 blood and urinary biomarkers and four physical examination indexes (body mass index (BMI), waist circumference, blood pressure, and heart rate) (B0) tested at the baseline in the UK Biobank (Supplementary Material 1) were analyzed to reflect the potential preclinical status of the women. As indicated by the UK Biobank, the selection of the biomarkers was based on their scientific relevance for studying a wide range of diseases.
To construct the biomarker and disease trajectory (B0 → D1 → D2 → D3) among women, the association between D1 and the biomarker was tested by considering D1 as the outcome and the biomarker (B0) from the blood or urine sample collected at the baseline as the exposure. The incidence density sampling procedure and matching criteria were the same as in the disease risk analysis. The majority of the biomarkers were standardized with a mean of zero and standard deviation of one, except for estradiol, rheumatoid factor in blood and microalbumin in urine, which were dichotomized according to their limit of detection (as more than 50% of the participants were under the limit of detection). Their associations with D1 were tested using conditional logistic regression models. All biomarkers and disease trajectories (B0 → D1 → D2 → D3) were mapped out using Cytoscape 3.5 (Shannon et al. 2003).
A flow chart of the methodology used in the analyses is illustrated in Supplementary Fig. 1. Sensitivity analyses included starting the follow-up 1 year after participating in the cohort and separating the analyses by age < 55 and age > 55.
Statistical analyses were performed using SAS (version 9.4; SAS Institute Inc, Cary, NC, USA), and R software (version 3.6.3; R Foundation for Statistical Computing, Vienna, Austria).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.