Rules for determining Class 1 positive variants (by each genome version)

WJ Wendell Jones
BG Binsheng Gong
NN Natalia Novoradovskaya
DL Dan Li
RK Rebecca Kusko
TR Todd A. Richmond
DJ Donald J. Johann, Jr
HB Halil Bisgin
SS Sayed Mohammad Ebrahim Sahraeian
PB Pierre R. Bushel
MP Mehdi Pirooznia
KW Katherine Wilkins
MC Marco Chierici
WB Wenjun Bao
LB Lee Scott Basehore
AL Anne Bergstrom Lucas
DB Daniel Burgess
DB Daniel J. Butler
SC Simon Cawley
CC Chia-Jung Chang
GC Guangchun Chen
TC Tao Chen
YC Yun-Ching Chen
DC Daniel J. Craig
AP Angela del Pozo
JF Jonathan Foox
MF Margherita Francescatto
YF Yutao Fu
CF Cesare Furlanello
KG Kristina Giorda
KG Kira P. Grist
MG Meijian Guan
YH Yingyi Hao
SH Scott Happe
GH Gunjan Hariani
NH Nathan Haseley
JJ Jeff Jasper
GJ Giuseppe Jurman
DK David Philip Kreil
Paweł Łabaj
KL Kevin Lai
JL Jianying Li
QL Quan-Zhen Li
YL Yulong Li
ZL Zhiguang Li
ZL Zhichao Liu
ML Mario Solís López
KM Kelci Miclaus
RM Raymond Miller
VM Vinay K. Mittal
request Request a Protocol
ask Ask a question
Favorite

There were a diversity of variant calling pipelines used by members of the SEQC consortium. We asked the bioinformaticians of each organization to develop their preferred pipelines with their best expertise to call variants on their selected WES and WGS datasets. For WES datasets, there were twenty-two pipelines developed by nine teams as shown in Additional file 1: Table S3. All teams selected certain WES datasets for which they had their best experience. Each WES1–3 dataset was analyzed by seven to fourteen different pipelines on either reference genome versions. All teams selected certain WES datasets for which they had their best experience. Each WES1–3 dataset was analyzed by seven to fourteen different pipelines on either reference genome versions.

The freedom of choice of datasets, reference genome versions, mappers, and callers as well as parameters and filters created diversity and resulted in marginal to significantly different results between variant calling pipelines on the same input data. We investigated the similarity of the pipeline-library combinations (PLCs) in terms of variant calling on the individual UHR cell lines. The results did not fall into simple patterns. Many PLCs provided quite similar variant calls while outlier pipelines were also detected.

To achieve consensus, we defined a Class 1 positive variant as having at least half of the PLCs call the variant with alternative allele frequency (VAF) no less than 10% on the same cell line for each of WES1–3. The variant list for each cell line was then pooled together across the cell lines by kit to generate a non-redundant list of variants for the pooled Sample A by kit. We then took the intersection of the non-redundant variants called for each of WES1, WES2, and WES3 to compose the Class 1 list of variants, which are defined as known positives in this study. We also considered the region for which we would define the Class 1 variants. Only variants called within the CTR were termed as Class 1 known positives. We performed this procedure for hg19 then conducted a liftover for mapping to hg38 genome positions.

The Class 1 positives are not a complete list of variants for pooled Sample A. However, given (i) the large sequencing depth, (ii) only variants with VAF ≥ 10% were considered by cell line, (iii) the variants were selected by voting with multiple PLCs with diversity of callers and also agreed among three WES datasets, and (iv) a random sample of 114 (including 33 at VAF < 5%) of these variants were 100% orthogonally verified by ddPCR, we consider the Class 1 variants to be known positives.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A