Quality control for WGS in MVP

SK Satoshi Koyama
ZY Zhi Yu
SC Seung Hoan Choi
SJ Sean J Jurgens
MS Margaret Sunitha Selvaraj
DK Derek Klarin
JH Jennifer E Huffman
SC Shoa L Clarke
MT Michael N Trinh
AR Akshaya Ravi
JD Jacqueline S Dron
CS Catherine Spinks
IS Ida Surakka
AB Aarushi Bhatnagar
KL Kim Lannery
WH Whitney Hornsby
SD Scott M Damrauer
KC Kyong-Mi Chang
JL Julie A Lynch
TA Themistocles L Assimes
PT Philip S Tsao
DR Daniel J Rader
KC Kelly Cho
GP Gina M Peloso
PE Patrick T Ellinor
ask Ask a question
Favorite

To confirm the accuracy, sensitivity, and specificity of the imputation, we have utilized the initial release of whole genome sequencing data in the MVP study. This data was collected and sequenced with a focus on elucidating the pathophysiology of COVID-19 infection from their genomes. The sequencing was performed using Illumina’s Sequencing by Synthesis technology to a targeted depth of 30x. Individual variant calling from 10,413 samples was performed on the cloud-based data and task management framework Trellis59. In summary, reads were aligned with BWA-MEM (version 0.7.15) on the GRCh38 reference genome, and variant calling was performed in GATK 4.1.0.0 using the haplotypeCaller function. Genotypes of all samples were aggregated into a matrix table using gVCF Combiner implemented in Hail55 for additional quality-control steps. In summary, we retained high-quality genotypes by applying the following steps: I. Variants in low complexity regions and ENCODE blacklist regions were removed. II. Variants within regions of atypical sequencing depth (DP < 10 or DP > 400) were discarded. For haploid genotypes on sex chromosomes, a minimum DP > 5 was required. III. Genotypes were retained if sites were: a. Homozygous reference with Genotype Quality > 20, or b. Alternate homozygotes with Phred-scaled likelihood of the genotype for reference homozygotes (PL[0]) > 20, and the ratio of depth for alternate alleles (DPALT) to total depth at the site (DPALT/DPSITE) > 0.9, or, c. Heterozygous with PL[0] > 20, and the ratio of the sum of DPALT and depth for reference alleles (DPREF) to DPSITE [(DPALT + DPREF)/DPSITE] > 0.9, and DPALT/DPSITE > 0.2. III. Variants with high missing rate (> 0.8) and population wide PHardy-Weinberg equilibrium ≤ 1 × 10−5 for variants with minor allele frequency (MAF) ≥ 1%, and PHardy-Weinberg equilibrium ≤ 1 × 10−6 for variants with MAF < 1% were discarded. IV. Samples with low call rate (≤ 0.97) or low overall sequencing coverage (mean depth ≤ 18) were excluded. This processing resulted in 187,790,701 variants in 10,390 individuals.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A