After phasing the UK Biobank genetic data (carried out on 81 chromosomal chunks using Eagle v.2.4), the phased data were converted from GRCh37 to GRCh38 using LiftOver112. Imputation was performed using Minimac4111.
We compared the correlation of genotypes between the exome-sequencing data released by the UK Biobank (following their SPB pipeline113) and the TOPMed-imputed genotypes. The comparison assessed 49,819 individuals and 3,052,260 autosomal variants that were found in both the exome-sequencing and TOPMed-imputed datasets (matched by chromosome, position and alleles, and with an imputation quality of at least 0.3 in the TOPMed-imputed data). We split the variants into MAF bins for which the MAF from the exome data was used to define the bins, and computed Pearson correlations averaged within each bin.
We tested single pLOF, nonsense, frameshift and essential splice-site variants85,86 for association with 1,419 PheCodes constructed from composites of ICD-10 (International Classification of Diseases 10th revision) codes to define cases and controls. Construction of the PheCodes has been previously described114. We performed the association analysis in the ‘white British’ individuals, which resulted in 408,008 individuals after the following quality control metrics were applied: (1) samples did not withdraw consent from the UK Biobank study as of the end of 2019; (2) ‘submitted gender’ matches ‘inferred sex’; (3) phased autosomal data available; (4) outliers for the number of missing genotypes or heterozygosity removed; (5) no putative sex chromosome aneuploidy; (6) no excess of relatives; (7) not excluded from kinship inference; and (8) in the UK Biobank defined the ‘white British’ ancestry subset. To perform the association analyses, we used a logistic mixed model test implemented in SAIGE114 with birth year and the top four principal components (computed from the white British subset) as covariates. For the pLOF burden tests, for each autosomal gene with at least two rare pLOF variants (n = 12,052 genes), a burden variable was created in which dosages of rare pLOF variants were summed for each individual. This sum of dosages was tested for association with the 1,419 traits using SAIGE. The same covariates used in the single-variant tests were included. For both the single-variant and the burden tests, we used 5 × 10−8 as the genome-wide significance threshold.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.