VCF preprocessing

WL Wen-Wei Liao
MA Mobin Asri
JE Jana Ebler
DD Daniel Doerr
MH Marina Haukness
GH Glenn Hickey
SL Shuangjia Lu
JL Julian K. Lucas
JM Jean Monlong
HA Haley J. Abel
SB Silvia Buonaiuto
XC Xian H. Chang
HC Haoyu Cheng
JC Justin Chu
VC Vincenza Colonna
JE Jordan M. Eizenga
XF Xiaowen Feng
CF Christian Fischer
RF Robert S. Fulton
SG Shilpa Garg
CG Cristian Groza
AG Andrea Guarracino
WH William T. Harvey
SH Simon Heumos
KH Kerstin Howe
MJ Miten Jain
TL Tsung-Yu Lu
CM Charles Markello
FM Fergal J. Martin
MM Matthew W. Mitchell
KM Katherine M. Munson
MM Moses Njagi Mwaniki
AN Adam M. Novak
HO Hugh E. Olsen
TP Trevor Pesout
DP David Porubsky
PP Pjotr Prins
JS Jonas A. Sibbesen
JS Jouni Sirén
CT Chad Tomlinson
FV Flavia Villani
MV Mitchell R. Vollger
LA Lucinda L. Antonacci-Fulton
GB Gunjan Baid
CB Carl A. Baker
AB Anastasiya Belyaeva
KB Konstantinos Billis
AC Andrew Carroll
PC Pi-Chuan Chang
SC Sarah Cody
request Request a Protocol
ask Ask a question
Favorite

We used a VCF file created on the basis of snarl traversal of the MC graph as a basis for genotyping. The records contained in this VCF represent bubbles in the underlying pangenome graph and their nested variants, derived from the snarl tree. Each variant was marked according to their level in this tree. Variants annotated by ‘LV=0’ correspond to the top-level bubbles. We used vcfbub (v.0.1.0)100 with parameters -l 0 and -r 100000 to filter the VCF. This removed all non-top-level bubbles from the VCF unless they were nested inside a top-level bubble with a reference length exceeding 100 kb; that is, top-level bubbles longer than that are replaced by their child nodes in the snarl tree. The VCF also contained the haplotypes for all 44 assembly samples, representing paths in the pangenome graph. We additionally removed all records for which more than 20% of all 88 haplotypes carried a missing allele (“.”). This resulted in a set of 22,133,782 bubbles. In a next step, we used PanGenie (v.1.0.0)54 to genotype these bubbles across all 3,202 samples from the 1KG based on high-coverage Illumina reads19.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A