Single base substitution calling

KY Kenichi Yoshida
KG Kate HC Gowers
HL Henry Lee-Six
DC Deepak P Chandrasekharan
TC Tim Coorens
EM Elizabeth F Maughan
KB Kathryn Beal
AM Andrew Menzies
FM Fraser R Millar
EA Elizabeth Anderson
SC Sarah E Clarke
AP Adam Pennycuick
RT Ricky M Thakrar
CB Colin R Butler
NK Nobuyuki Kakiuchi
TH Tomonori Hirano
RH Robert E Hynds
MS Michael R Stratton
IM Inigo Martincorena
SJ Sam M Janes
PC Peter J Campbell
request Request a Protocol
ask Ask a question
Favorite

Single base substitution (SBSs) were called using the Cancer Variants through Expectation Maximisation (CaVEMan) algorithm40 with copy number options of major copy number 5, minor copy number 2 and normal contamination 0.1. In order to allow the discovery of early embryonic mutations, we ran CaVEMan using an unmatched normal control. In addition to the default “PASS” filter, we removed variants with <120 median alignment score (ASMD) and those with >0 for the clipping index (CLPM) to remove mapping artefacts. Also, variants identified in the mouse feeder fibroblast DNA sample were removed, if they persisted in the call-set. Subsequently, for every mutation identified in any colonies from each patient, we counted the number of mutant and wild-type reads in all bronchial samples from the same patient using bam2R function of R package deepSNV41, where bases with ≥30 base quality and sequencing reads with ≥30 mapping quality were used. Further filters described below were applied to identify true somatic mutations and separate them from either germline variants or recurrent sequencing errors.

We fitted a binomial distribution to the total variant counts and total depth at each SBS site across all samples from one patient. To differentiate somatic variants from germline variants, we used a one-sided exact binomial test, with the null hypothesis that these variants were drawn from a binomial distribution with a success probability of 0.5 (0.95 for sex chromosomes in males). The alternative hypothesis was that these variants were drawn from distributions with lower success probabilities. Variants with p-value >10-10 were considered as germline variants.

We fitted a beta-binomial distribution to the variant counts and depths of all SBSs across samples from the same patient for the remaining somatic variants. The beta-binomial was used as it captures the difference between artefactual variant sites and true somatic variants. Many artefacts appear to be randomly distributed across samples and can be modelled as drawn from a binomial distribution. True somatic variants will be present at high VAF in some samples, but absent in others, and are hence best captured by a highly over-dispersed beta-binomial. For each variant site, the maximum likelihood of the over-dispersion factor (ρ) was calculated using a grid-based method (ranging from a value of 10-6 to 10-0.05). Variants ρ>0.1 were filtered out and considered to be artefactual. The code for this filter is based on the Shearwater variant caller41.

We observed peaks of lower VAFs in a subset of samples (Extended Figure 2C), suggesting the existence of mutations arising during the in vitro expansion of the single cell. These peaks were more prominent in samples from children, suggesting that the number of this kind of mutation is relatively small – they would, however, be more prominent in samples with low true mutation burden, such as in children. We discarded mutations with median VAF ≤0.3 for autosomal regions and ≤0.6 for sex chromosomes across all samples from the same patient – these cut-offs were determined based on the observed distribution of VAFs here and a previous report20.

We quantified sensitivity by measuring how well our algorithms called heterozygous germline polymorphisms in the colonies depending upon sequencing depth – since our colonies are single cell-derived, we would expect heterozygous germline SNPs to have the same variant allele fraction distribution as true somatic mutations in that original single cell. We find that a sequencing depth of 8x leads to an estimated sensitivity of 70-75%, rising to >95% at a sequencing depth of 15x. The majority of colonies we sequenced had depths of >15x, and we set a minimum cut-off of 8x depth for inclusion of a colony within the study (Extended Figure 2E). Finally, we visually inspected allelic counts of removed germline variants with ≥2 samples without any mutant reads, and rescued embryonic mutations. Somatic variants were annotated using ANNOVAR42.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A