Imputation of the UK Biobank to the TOPMed panel and association analyses

DT Daniel Taliun
DH Daniel N. Harris
MK Michael D. Kessler
JC Jedidiah Carlson
ZS Zachary A. Szpiech
RT Raul Torres
ST Sarah A. Gagliano Taliun
AC André Corvelo
SG Stephanie M. Gogarten
HK Hyun Min Kang
AP Achilleas N. Pitsillides
JL Jonathon LeFaive
SL Seung-been Lee
XT Xiaowen Tian
BB Brian L. Browning
SD Sayantan Das
AE Anne-Katrin Emde
WC Wayne E. Clarke
DL Douglas P. Loesch
AS Amol C. Shetty
TB Thomas W. Blackwell
AS Albert V. Smith
QW Quenna Wong
XL Xiaoming Liu
MC Matthew P. Conomos
DB Dean M. Bobo
FA François Aguet
CA Christine Albert
AA Alvaro Alonso
KA Kristin G. Ardlie
DA Dan E. Arking
SA Stella Aslibekyan
PA Paul L. Auer
JB John Barnard
RB R. Graham Barr
LB Lucas Barwick
LB Lewis C. Becker
RB Rebecca L. Beer
EB Emelia J. Benjamin
LB Lawrence F. Bielak
JB John Blangero
MB Michael Boehnke
DB Donald W. Bowden
JB Jennifer A. Brody
EB Esteban G. Burchard
BC Brian E. Cade
JC James F. Casella
BC Brandon Chalazan
DC Daniel I. Chasman
YC Yii-Der Ida Chen
request Request a Protocol
ask Ask a question
Favorite

After phasing the UK Biobank genetic data (carried out on 81 chromosomal chunks using Eagle v.2.4), the phased data were converted from GRCh37 to GRCh38 using LiftOver112. Imputation was performed using Minimac4111.

We compared the correlation of genotypes between the exome-sequencing data released by the UK Biobank (following their SPB pipeline113) and the TOPMed-imputed genotypes. The comparison assessed 49,819 individuals and 3,052,260 autosomal variants that were found in both the exome-sequencing and TOPMed-imputed datasets (matched by chromosome, position and alleles, and with an imputation quality of at least 0.3 in the TOPMed-imputed data). We split the variants into MAF bins for which the MAF from the exome data was used to define the bins, and computed Pearson correlations averaged within each bin.

We tested single pLOF, nonsense, frameshift and essential splice-site variants85,86 for association with 1,419 PheCodes constructed from composites of ICD-10 (International Classification of Diseases 10th revision) codes to define cases and controls. Construction of the PheCodes has been previously described114. We performed the association analysis in the ‘white British’ individuals, which resulted in 408,008 individuals after the following quality control metrics were applied: (1) samples did not withdraw consent from the UK Biobank study as of the end of 2019; (2) ‘submitted gender’ matches ‘inferred sex’; (3) phased autosomal data available; (4) outliers for the number of missing genotypes or heterozygosity removed; (5) no putative sex chromosome aneuploidy; (6) no excess of relatives; (7) not excluded from kinship inference; and (8) in the UK Biobank defined the ‘white British’ ancestry subset. To perform the association analyses, we used a logistic mixed model test implemented in SAIGE114 with birth year and the top four principal components (computed from the white British subset) as covariates. For the pLOF burden tests, for each autosomal gene with at least two rare pLOF variants (n = 12,052 genes), a burden variable was created in which dosages of rare pLOF variants were summed for each individual. This sum of dosages was tested for association with the 1,419 traits using SAIGE. The same covariates used in the single-variant tests were included. For both the single-variant and the burden tests, we used 5 × 10−8 as the genome-wide significance threshold.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A