There were a diversity of variant calling pipelines used by members of the SEQC consortium. We asked the bioinformaticians of each organization to develop their preferred pipelines with their best expertise to call variants on their selected WES and WGS datasets. For WES datasets, there were twenty-two pipelines developed by nine teams as shown in Additional file 1: Table S3. All teams selected certain WES datasets for which they had their best experience. Each WES1–3 dataset was analyzed by seven to fourteen different pipelines on either reference genome versions. All teams selected certain WES datasets for which they had their best experience. Each WES1–3 dataset was analyzed by seven to fourteen different pipelines on either reference genome versions.
The freedom of choice of datasets, reference genome versions, mappers, and callers as well as parameters and filters created diversity and resulted in marginal to significantly different results between variant calling pipelines on the same input data. We investigated the similarity of the pipeline-library combinations (PLCs) in terms of variant calling on the individual UHR cell lines. The results did not fall into simple patterns. Many PLCs provided quite similar variant calls while outlier pipelines were also detected.
To achieve consensus, we defined a Class 1 positive variant as having at least half of the PLCs call the variant with alternative allele frequency (VAF) no less than 10% on the same cell line for each of WES1–3. The variant list for each cell line was then pooled together across the cell lines by kit to generate a non-redundant list of variants for the pooled Sample A by kit. We then took the intersection of the non-redundant variants called for each of WES1, WES2, and WES3 to compose the Class 1 list of variants, which are defined as known positives in this study. We also considered the region for which we would define the Class 1 variants. Only variants called within the CTR were termed as Class 1 known positives. We performed this procedure for hg19 then conducted a liftover for mapping to hg38 genome positions.
The Class 1 positives are not a complete list of variants for pooled Sample A. However, given (i) the large sequencing depth, (ii) only variants with VAF ≥ 10% were considered by cell line, (iii) the variants were selected by voting with multiple PLCs with diversity of callers and also agreed among three WES datasets, and (iv) a random sample of 114 (including 33 at VAF < 5%) of these variants were 100% orthogonally verified by ddPCR, we consider the Class 1 variants to be known positives.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.