Somatic variant calling, filtering, and MAF generation

ZZ Zhenyu Zhang
KH Kyle Hernandez
JS Jeremiah Savage
SL Shenglai Li
DM Dan Miller
SA Stuti Agrawal
FO Francisco Ortuno
LS Louis M. Staudt
AH Allison Heath
RG Robert L. Grossman
request Request a Protocol
ask Ask a question
Favorite

The initial GDC release includes four somatic variant callers: MuTect2, VarScan2, MuSE, and SomaticSniper.

MuTect2 is built upon the capability of local de novo assembly by HaplotypeCaller and somatic genotyping engine of Mutect. Mutect applies a Bayesian classifier to detect somatic mutations11. The GDC uses MuTect2 tools from the GATK nightly-2016-02-25-gf39d340 version. Before tumor normal pairs can be used for somatic variant calling, it is important to generate a Panel of Normals (PoNs) filter that contains calling artifacts and potential germline variants. As mentioned previously, whole genome amplified (WGA) samples are analyzed with dontUseSoftClippedBases turned on.

VarScan2 is another somatic variant caller that identifies both SNV and INDELS. It uses heuristics and statistics to identify variants and considers the confounding impacts of read depth, base quality, variant allele frequency and statistical significance13. GDC uses VarScan2 version 2.3.9. The first step of VarScan2 calling is to generate a mpileup file of both tumor and normal BAMs using samtools for a single mpileup file. We set the quality cutoff for samtools to be 1 and also disabled Base Alignment Quality score computation. The mpileup is then used as input to VarScan Somatic to generate a VCF file that contains both SNP and INDEL calls. The resulting VCF is filtered for significant calls using VarScan ProcessSomatic.

MuSE calls somatic variants using Markov Substitution model for Evolution12. The first step, “MuSE call”, estimates the equilibrium frequencies of all four alleles and presents the maximum a posteriori on every genomics locus. The second step, “MuSE sump”, performs a tier based cutoff based on a sample-specific error model which also takes dbSNP information into account. GDC uses MuSE version 1.0rc_submission_c039ffa. Parallelization can be implemented for the first step of MuSE, based on genomic chunks, which can accelerate the production close to linear. The GDC currently only passes calls with quality filter “PASS” to the GDC public MAF files; however, variants with other quality Tier values could also be considered a user’s discretion.

SomaticSniper is a somatic variant caller that only identifies SNPs. It uses a bayesian inference to compare genotype likelihoods between tumor and normals14. GDC uses the default parameter settings of SomaticSniper version 1.0.5.0.

In addition to the built-in filters in each somatic caller, the GDC also applies additional filtering tools to label caller-generated variants. Because these filters are frequently updated, we have highlighted only a few of the major steps below.

False Positive Filter (FPFilter, https://github.com/ucscCancer/fpfilter-tool) was applied to both VarScan2 and SomaticSniper VCFs.

SomaticSniper variants with SSC < 25 are removed from annotated VCFs. This is the only step in the entire GDC somatic variant pipeline in which low-quality variants are removed, instead of tagged.

A WXS Panel of Normals was generated internally by MuTect2 calling on about 5,000 TCGA normal WXS samples in artifact detection mode and combined using GATK CombineVariants. The GDC received the sample list from the TCGA Genomics Data Analysis Center (GDAC) as TCGA normal samples that were previously identified to be free of hematopoiesis events (unpublished) at the time of GDC data processing. This PoN is not only used as a MuTect2 built-in filter44, but also applied to the other three somatic calling outputs in a similar manner.

d-ToxoG (http://archive.broadinstitute.org/cancer/cga/dtoxog) is used to remove oxoG artifacts from point mutation calls. These artifacts were generated due to oxidative DNA damage during sample preparation45.

DKFZ Strandbias Filter (https://github.com/eilslabs/DKFZBiasFilter) is used to tag variants that are supported with significant bias from one strand direction compared to the other.

Mutation Annotation Format (MAF) is a tab-delimited text file with aggregated mutation information from VCF Files and are generated on a project-level. The GDC currently produces two types of MAF files: controlled-access MAFs that contain all variants in VCFs, and open-access somatic MAFs that contain filtered variants and reduced germline contaminations and thus considered “high quality”. Any user can explore the open-access somatic MAF for high quality calls; while a more sophisticated user may want to apply for dbGaP access to obtain the superset of mutations in the controlled-access MAF. With the larger set of mutations they may perform custom filtering based on FILTER and GDC_FILTER columns, or collect information that was removed from the open-access version, such as supporting read depth in the normal samples.

The specification of the GDC MAF can be found at https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A