Method details

EG Eugene J. Gardner
KK Katherine A. Kentistou
SS Stasa Stankovic
SL Samuel Lockhart
EW Eleanor Wheeler
FD Felix R. Day
NK Nicola D. Kerrison
NW Nicholas J. Wareham
CL Claudia Langenberg
SO Stephen O'Rahilly
KO Ken K. Ong
JP John R.B. Perry
request Request a Protocol
ask Ask a question
Favorite

To conduct rare variant burden analyses outlined in this publication, we queried ES data for 454,787 individuals provided by the UKBB study.5 Individuals were excluded from further analysis if they had excess heterozygosity, autosomal variant missingness on genotyping arrays ≥5%, or were not included in the subset of phased samples as defined in Bycroft et al.60 We further excluded all study participants who were not of broadly European genetic ancestry, leaving a total of 420,162 individuals for further analysis. Additional individuals were subsequently excluded when assessing specific phenotypes due to the incomplete nature of survey data for somephenotypes (Table S5).

To perform variant quality control and annotation, we utilised the UKBB Research Analysis Platform (RAP; https://ukbiobank.dnanexus.com/). The RAP is a cloud-based compute environment which provides a central data repository for UKBB ES and phenotypic data. Using bespoke applets designed for the RAP, we performed additional quality control of ES data beyond that already documented in Backman et al.5 Using provided population-level Variant Call Format (VCF) files, we first split and left-corrected multi-allelic variants into separate alleles using ‘bcftools norm’.61 Next, we performed genotype-level filtering using ‘bcftools filter’ separately for Single Nucleotide Variants (SNVs) and Insertions/Deletions (InDels) using a missingness-based approach. With this approach, SNV genotypes with depth <7 and genotype quality <20 or InDel genotypes with a depth <10 and genotype quality <20 were set to missing (i.e. ./.). We further tested for an expected alternate allele contribution of 50% for heterozygous SNVs using a binomial test; SNV genotypes with a binomial test p. value ≤ 1 × 10−3 were set to missing. Following genotype-level filtering we recalculated the proportion of individuals with a missing genotype for each variant and filtered all variants with a missingness value >50%.

We next annotated variants using the ENSEMBL Variant Effect Predictor (VEP) v10462 with the ‘--everything’ flag and plugins for REVEL,15 CADD,63 and LOFTEE64 enabled. For each variant, we prioritised a single ENSEMBL transcript based on whether or not the annotated transcript was protein-coding, MANE select v0.97, or the VEP Canonical transcript, respectively. Individual consequence for each variant was prioritised based on severity as defined by VEP. Following annotation, we grouped stop gained, frameshift, splice acceptor, and splice donor variants into a single Protein Truncating Variant (PTV) category. Missense and synonymous variant consequences are identical to those defined by VEP. Only autosomal or chrX variants within ENSEMBL protein-coding transcripts and within transcripts included on the UKBB ES assay were retained for subsequent burden testing.

To perform rare variant burden tests using filtered and annotated ES data, we employed a custom implementation of BOLT-LMM v2.3.617 for the RAP. BOLT-LMM expects two primary inputs: i) a set of genotypes with minor allele count >100 derived from genotyping arrays to construct a null model and ii) a larger set of imputed variants to perform association tests. For the former, we queried genotyping data available on the RAP and restricted to an identical set of individuals used for rare variant association tests. For the latter, and as BOLT-LMM expects imputed genotyping data as input rather than per-gene carrier status, we created dummy genotype files where each variant represents one gene and individuals with a qualifying variant within that gene are coded as heterozygous, regardless of the number of variants that individual has in that gene. To test a range of variant annotation categories across the allele frequency spectrum, we created dummy genotype files for minor allele frequency <0.1% and singleton high confidence PTVs as defined by LOFTEE, missense variants with REVEL ≥0.5, missense variants with REVEL ≥0.7, and synonymous variants. For each phenotype tested, BOLT-LMM was then run with default parameters other than the inclusion of the ‘lmmInfOnly’ flag. When exploring the role of rare variants in the IGF-1/GH axis and to incorporate less deleterious missense variants, we also used an additional set of variant annotations which combined missense variants with CADD ≥25 and high confidence PTVs (i.e. Damaging; Table S3). To derive association statistics for individual variants, we also provided all 26,657,229 individual markers regardless of filtering status as input to BOLT-LMM. All tested phenotypes were run as continuous traits corrected by age, age,2 sex, the first ten genetic principal components as calculated in Bycroft et al.,60 and study participant ES batch as a categorical covariate (either 50 k, 200 k, or 450 k). For phenotype definitions used in this study, including our T2D definition adapted from Eastwood et al.,65 please refer to Table S5. Only the first instance (initial visit) was used for generating all phenotype definitions unless specifically noted in Table S5.

To provide an orthogonal approach to validate our BOLT-LMM results, we also performed per-gene burden tests with STAAR16 and a generalised linear model as implemented in the python package ‘statsmodels’.66 To run STAAR, we created a custom Python and R workflow on the RAP. VCF files were first converted into a sparse matrix suitable for use with the R package ‘Matrix’ using ‘bcftools query’. Using the ‘STAAR’ R package, we first ran a null model with identical coefficients to BOLT-LMM and a sparse relatedness matrix with a relatedness coefficient cutoff of 0.125 as described by Bycroft et al.60 We next used the function ‘STAAR’ to test all protein-coding transcripts as outlined above. To run generalised linear models, we used a three step process. First, we ran a null model with all dependent variables as continuous traits, corrected for control covariates identical to those included in BOLT-LMM. Next, using the residuals of this null model, we performed initial regressions on carrier status to obtain a preliminary p. value. Finally, for individual genes that passed a lenient p. value threshold of <1 × 10−4, we recalculated a full model to obtain exact test statistics with family set to ‘binomial’ or ‘gaussian’ if the trait was binary or continuous, respectively. Generalised linear models utilised identical input to BOLT-LMM converted to a sparse matrix.

To conduct heterogeneity tests between two calculated odds ratios, we implemented a Z test as sample sizes were sufficiently large.67 This approach was used to determine if missense variants within the IGF1R protein kinase domain have a stronger effect on T2D risk than those elsewhere in IGF1R. A Z-score was calculated by dividing the difference in the coefficients by the standard error of the difference, as shown below:

Z = (Beta from model 1) – (Beta from model 2)/((SE of Beta from model 1)2 + (SE of Beta from model 2)2)1/2

P-values were estimated for Z scores from a normal distribution.

Common variant associations at the identified genes were queried using the T2D Knowledge Portal (https://t2d.hugeamp.org) and the Open Targets Genetics platform (https://genetics.opentargets.org/).68 Trait associations from the T2D Knowledge Portal are presented in Table S2 and were only included if the paired gene was assigned as the nearest gene to the association signal as a crude proxy for causality. Accompanying HuGE scores were extracted for the highest-scoring glycaemic common variants associations. Locus2Gene scores based on data from Vujkovic et al.2 were extracted from the Open Targets Genetics platform and are presented in Table S2. For the IGF1R locus follow-up, we used sentinel SNP information for Vujkovic et al.2 and summary statistics from the recent fasting glucose MAGIC meta-analysis21 and circulating IGF-1 levels GWAS. eQTL data was accessed through GTEx v822. Effect estimates in the text have been aligned toward the T2D/glucose increasing alleles, using LD information from LDlink.22 Colocalization between the IGF-1 and fasting glucose GWASs was ascertained using coloc v5.1.069 (Figure 3), where posterior probability for H3 (i.e. PP3) is the probability of two independent causal variants, while H4 is the probability of a single, shared causal variant. For this, variants within a 500 kb window of IGF1R that were common between the two studies were used. Regions in Figure 3 were plotted using LocusZoom.70

To examine the likelihood of a causal effect of IGF-1 on the risk of T2D, we applied Mendelian randomisation (MR) analysis. In this approach, genetic variants that are significantly associated with an exposure of interest are used as instrumental variables (IVs) to test the causality of that exposure on the outcome of interest. For a genetic variant to be a reliable instrument, the following assumptions should be met: (1) the genetic instrument is associated with the exposure of interest, (2) the genetic instrument should not be associated with any other competing risk factor that is a confounder, and (3) the genetic instrument should not be associated with the outcome, except via the causal pathway that includes the exposure of interest.71 As IVs, we used the 831 IGF-1 genome-wide significant signals reported in a recent GWAS on IGF-1.24 As our outcome data, we selected the largest publicly available independent T2D dataset available in 893,130 European genetic ancestry individuals (9% cases) from Mahajan et al.31 If a signal was not present in the outcome GWAS, we searched the UKBB white European dataset for proxies (within 1 Mb and r2 > 0.5) and chose the variant with the highest r2 value, which left 784 independent markers for MR analysis. Genotypes at all variants were aligned to designate the IGF-1-increasing alleles as the effect alleles.

To conduct our MR analysis, we used the inverse-variance weighted (IVW) model as the primary model as it offers the most statistical power72; however, as it does not correct for heterogeneity in outcome risk estimates between individual variants,73 we applied a number of sensitivity MR methods that better account for heterogeneity.74 These include an Egger analysis to identify and correct for unbalanced heterogeneity (‘horizontal pleiotropy’), indicated by a significant Egger intercept (p<0.05),75 and weighted median (WM) and penalised weighted median (PWM) models to correct for balanced heterogeneity.76 In addition, we introduced the radial method to exclude variants from each model in cases where they are recognized as outliers, as well as Steiger filtering to assess for potential reverse causality (i.e. variants with stronger association with the outcome than with the exposure).77 As previous work on IGF-1 showed a strong association with height, and to a lesser extent BMI,30 we also used multivariable MR analysis78 to estimate the direct effect of IGF-1 levels on T2D not mediated by BMI or height by adjusting for their effects as covariates using queried phenotype data for UKBB participants (Tables S4 and S5). In order to examine the individual level effect of IGF1 and IGF1R loci on T2D, BMI, childhood and adult height, we performed the variant-specific lookups as well as calculated the Wald ratio using the R package 'TwoSampleMR'.79 All results presented in the main text are expressed in standard deviations of IGF-1 levels (one S.D. is equivalent to ∼5.5 nmol/L in UKBB). Values available in Table S4 are raw data, per unit IGF-1.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A