To find genetic variants influencing individuals’ risks for topics of diseases, we perform GWAS using inferred topic weights as continuous traits (topic-GWAS) using PLINK2.39,73 Since topic weights are real numbers in the range of 0 to 1, the basic assumptions of linear regression do not hold. We apply a logit transformation on topic weights to address this issue before fitting the standard linear model for GWAS. We validate that using logit transformation gives better results than using rank based inverse normal transformation and using no transformation on topic weights. The validation is done by comparing the number of significant loci found by different methods, and the predictive performance of PRS for single codes based on topic-GWAS results (see below). For topic-GWAS, we only include common SNPs (SNPs with a minor allele frequency (MAF) larger than 0.01 in UKB) and individuals who self-report having British ancestry in the training dataset (343,006 individuals in total). Sex, age and the first ten principal components (PCs) of genomic variation are controlled for.
For comparison, we also perform GWAS (logistic regression) using the presence and absence of single ICD-10 codes as binary traits (single code GWAS). The inclusion criterion for individuals, SNPs and covariates are the same as topic-GWAS. In addition to ICD-10 codes, we also use terminal Phecodes mapped from the top 100 ICD-10 codes as traits for single code GWAS. Phecodes are defined by systematically grouping terminal ICD-10 codes into more applicable medical terms based on the judgements of clinicians and researchers,78 which reduces the granularity of terminal ICD-10 codes. Similar to ICD-10 codes, there is a hierarchical coding system for Phecodes. To map the ICD-10 codes used in the top-100 UKB dataset to phecodes, we firstly extract all terminal ICD-10 codes (on the fifth layer of the ICD-10 tree) that are children codes of the 100 level-4 ICD-10 codes, and then retrieve their corresponding Phecodes according to the Phecode map (https://phewascatalog.org/phecodes) In total, there are 296 terminal Phecodes mapped from the 100 ICD-10 codes.
Inflation in P-values are observed for the topic-GWAS results given by all three topic models. The inflation can either be resulted from true polygenicity of the traits (topic weights), or stratification in the population. To differentiate these two possibilities, we carry out LD score regression (LDSC)57 using the summary statistics of topic-GWAS for all topics. Pre-computed LD scores (based on 1000 Genomes European data) are downloaded and used in the analyses. The genomic control inflation factor λGC and the intercept of LDSC output by the algorithm are compared. A large λGC and small intercept for the same trait suggest true polygenicity causing the inflation in P-values, while large values for both λGC and intercept suggest stratification in the population.
To define genomic loci from significant SNPs (P<5×10−8) found by GWAS, we use the clumping function implemented in PLINK-1.9 and the threshold of r2>0.1 to group SNPs in linkage disequilibrium (LD). We define a locus to be an association for both topic-GWAS and single code GWAS as follows: the significant lead SNP found by one GWAS method can be clumped with a significant lead SNP found by the other method.
In addition to grouping disease codes via topic modelling, we also group disease codes completely following the medical ontologies (ICD-10 and Phecode systems). In other words, we use internal codes (such as blocks of categories of diseases and chapters of disease, corresponding to the nodes in the third and second layers of the ICD-10 coding system) of the two disease classification systems as binary traits for single code GWAS. For instance, if both disease codes A and B are under a common parent code C on the tree, then C will be used as the trait for GWAS, and individuals who are diagnosed with either A or B will be used as cases for the single code GWAS for code C. For the 100 ICD-10 codes there are 68 internal codes, and for the 296 Phecodes there are 136 internal codes.
To illustrate the potential reason for only a small proportion of single code associated loci being identified by topic-GWAS, we used a simple simulation. We first simulated five binary genotype variables for 20,000 individuals: . Then for each genotype variable, we simulated an associated disease variable: , where The simulation of the disease variable was repeated until logistic regression on the disease variable yielded a P-value less than 0.05 for the genotype variable. After simulating the five SNPs and their corresponding associated diseases, we combined the five diseases as one binary trait (individuals having any of the five diseases will have value 1 for the combined disease variable) and carried out logistic regression on the combined disease variable using the five SNPs as predictors. We repeated this simulation 1,000 times and calculated the proportion of single disease associated SNPs that showed significant association (P-value <0.05) with the combined disease variable.
In addition to topics inferred by treeLFA, topic-GWAS for flatLFA, LDA and GETM inferred topics are also performed. For LDA and GETM, only individuals with at least one diagnosed disease code are used as input for inference. For topic-GWAS, there are two options to deal with the individuals without any diagnosis. We can either exclude them or include them and give them small random weights for all disease topics. We experiment with both methods and find that excluding the completely healthy individuals results in a larger power for topic-GWAS.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.