Validation with targeted PCR amplicon-based sequencing approaches

YW Yifan Wang
TB Taejeong Bae
JT Jeremy Thorpe
MS Maxwell A. Sherman
AJ Attila G. Jones
SC Sean Cho
KD Kenneth Daily
YD Yanmei Dou
JG Javier Ganz
AG Alon Galor
IL Irene Lobon
RP Reenal Pattni
CR Chaggai Rosenbluh
ST Simone Tomasi
LT Livia Tomasini
XY Xiaoxu Yang
BZ Bo Zhou
SA Schahram Akbarian
LB Laurel L. Ball
SB Sara Bizzotto
SE Sarah B. Emery
RD Ryan Doan
LF Liana Fasching
YJ Yeongjun Jang
DJ David Juan
EL Esther Lizano
LL Lovelace J. Luquette
JM John B. Moldovan
RN Rujuta Narurkar
MO Matthew T. Oetjens
RR Rachel E. Rodin
SS Shobana Sekar
JS Joo Heon Shin
ES Eduardo Soriano
RS Richard E. Straub
WZ Weichen Zhou
AC Andrew Chess
JG Joseph G. Gleeson
TM Tomas Marquès-Bonet
PP Peter J. Park
MP Mette A. Peters
JP Jonathan Pevsner
CW Christopher A. Walsh
DW Daniel R. Weinberger
FV Flora M. Vaccarino
JM John V. Moran
AU Alexander E. Urban
JK Jeffrey M. Kidd
RM Ryan E. Mills
AA Alexej Abyzov
request Request a Protocol
ask Ask a question
Favorite

Different PCR amplicon-based sequencing approaches were used to validate mosaic SNV candidates identified using the different computational pipelines. Each of the validation procedures is described below.

We validated the called SNVs by PCR amplicon-seq using following samples: (i) reference tissue—pulverized cortex DNA; (ii) reference tissue—fibroblast DNA; and (iii) control lymphoblastoid cell line DNA—NA12878. We determined the genomic location of each SNV using the Genome Browser. We designed each primer pair by selecting an 800-nucleotide-long DNA template surrounding the SNV (400 nucleotides up- and 399 nucleotides down-stream of each SNV) using Primer3 [43]. The specificity of primers was confirmed by UCSC in silico PCR (http://genome.ucsc.edu/cgi-bin/hgPcr) (Additional file 3:Table S5).

The PCR amplification was performed using the Phusion High-Fidelity DNA polymerase (Thermo Fisher). The optimal annealing temperature was predicted using the Tm calculator tool provided by the Thermo Fisher website. The presence and size of each amplicon was confirmed by gel electrophoresis (2% agarose gel). In samples where we detected multiple PCR products, DNA bands with the correct amplicon sizes were extracted via gel extraction and purification using 2% agarose precast gels (E-Gel EX, Invitrogen) and gel imager (E-Gel Safe Imager connected to E-Gel iBase, Invitrogen), according to the recommendations provided by the manufacturer. Amplified DNA fragments were purified using the QIAquick PCR Purification Kit (Qiagen). Samples were multiplexed and sequenced under following conditions: MiSeq paired-end, 300 bp, 3 samples/lane.

Putative mosaic SNVs were validated by high-throughput sequencing of amplicons that contain the SNV and then calculating the relative abundance of reads containing the alternate allele. Primers were designed using Primer 3 software [43] with 300–400 bp of genomic sequencing surrounding the SNV as the input. Since the read length of the amplification product is ~ 300 bp, we were able to gain an overlapped region between the paired reads. With the overlapped regions containing the somatic SNV candidates, we were able to sequence each candidate site twice and increase the accuracy of sequencing result by excluding reads with non-concordant bases from the pair ended reads at the candidate SNV positions. If possible, SNPs known to be heterozygous in our samples were included in the amplified sequence used as input. Primers were tested in silico [44] to confirm they uniquely target the correct region of the genome. Phusion High-Fidelity DNA Polymerase (New England Biolabs) was used according to instructions provided by the manufacturer for amplification, and primers were cycled under varying conditions to determine optimal PCR mix and annealing temperature. To generate amplicons for sequencing, either NA12878 or gDNA from the 5154 common reference brain tissue was used as a template. The PCR product was purified with 0.7× SPRIselect beads (Beckman Coulter) and 10% of product was visualized on an agarose gel to confirm that only one amplicon of the correct size was present. If the size of the amplicon agreed with the size predicted from the primer design, we then sequenced the targeted amplified genomic fragment using an Illumina MiSeq sequencer. If the size was incorrect, we designed a second set of primers to obtain unique amplification. If none of the primers worked, the candidate was flagged with “primer not designed.” Protocols and reagents from NEBNext Ultra™ DNA Library Prep Kit for Illumina (New England Biolabs) were used for end repair and dA-tailing and to ligate NextFlex adapters (Perkin Elmer, Waltham, MA) onto amplicons. After ligation, reactions were purified with 0.7× SPRIselect beads (Beckman Coulter) and PCR enrichment of adapter-ligated DNA was performed for 10 cycles using NEBNext Ultra™ DNA Library Prep Kit (New England Biolabs). Amplified libraries were purified with 0.7× SPRIselect beads and sequenced with MiSeq Reagent Kit v3, 600 cycle PE on MiSeq sequencer (Illumina, San Diego, CA).

We sent all 400 target sites to Paragon Genomics and obtained a CleanPlex Custom NGS Panel allowing us to perform two multiplex PCR-based targeted resequencing assays for a total of 382 target sites. Eighteen of 400 sites were deemed not reliably targetable. The panels were designed using a proprietary primer design algorithm (sequences are available upon request) and proprietary background cleaning and molecular barcoding technologies to amplify the regions of interest and attach adapters for Illumina sequencing. The panel was iteratively optimized in silico prior to amplifying DNA from the common sample. The amplicons ranged in size from 250 to 425 bp (including the adapter sequences). A single MiSeq run provided ample coverage for all successfully targeted sites.

Targeted validation was attempted on 100 candidate mosaic variants and primers for 98 of those were successfully generated using BatchPrimer3 [45]. PCR primers were synthesized with flanking Ion Torrent adapters P and A. Each primer pair was unique and their sequence served as a barcode for variant identification. Phusion HotStart II DNA Polymerase (Thermo) CR was used for DNA amplification over 25 cycles following the manufacturer guidelines. PCR products were pooled and purified using AMPure XP technology (Agencourt) to be sequenced on the Ion Torrent S5 platform (Ion 530 chip). After demultiplexing and trimming, reads were mapped using BWA-MEM and locally realigned with GATK. Average coverage of mappable reads was 96,350× per variant. Mosaic fractions were identified using custom scripts based on SAMtools mpileup.

We validated candidate mosaic SNVs by targeted, high-coverage amplicon sequencing. PCR primers were designed with Primer3 [43] to generate amplicons from 300 to 500 bp and in silico verified to minimize the possibility off-target products. Amplicon sizes and purity were confirmed by PCR followed by agarose gel electrophoresis. Amplicons were quantitated, purified, and pooled for each of the NA12878 and 5154 brain tissue samples. NGS libraries were prepared using Illumina TruSeq Nano DNA kit and following standard Illumina NGS library preparation protocols. Libraries were sequenced with MiSeq Reagent Kit v3, 600 cycle PE on a MiSeq. Following demultiplexing and trimming, paired-end reads were merged with FLASH [46] and aligned with BWA-MEM to hs37d5 human reference genome. The average depth of sequencing was 103,335× per candidate. Target mosaicism was assessed by SAMtools mpileup and custom scripts.

We constructed a framework for uniform processing of PCR amplicon validation data. The PCR amplicon paired-end sequencing data was assembled to a single read using PEAR [47] with the non-concordant bases between the two reads set to N with a base quality of zero. We then aligned the reads to GRCh37d5 using BWA-MEM. Indel realignment was performed using the Genome Analysis Toolkit [28]. After pre-processing, we applied a series of filters to evaluate the putative mosaic SNV calls. We required a minimum of 200 reads covering the candidate mosaic SNV prior to making a “call” decision. Sites with fewer than 200 reads covering the candidate SNV were assigned the flag of “read not enough.” We then compared the allele fractions of the candidate mosaic SNV in the amplicon data from common reference brain sample and from NA12878. Since the same mosaic SNV event is unlikely to occur in two different individuals, we reasoned that bona fide mosaic SNVs should only be present in the brain sample. We applied a hard cutoff that removed any site with a candidate allele fraction > 0.01 in NA12878 and additionally applied a Skellam test to compare the putative mosaic SNV allele frequencies in the two samples. Together, these criteria excluded likely false-positive mosaic SNV calls that arise from sequencing errors in certain genomic contexts. We additionally established an empirical error model to exclude false-positive mosaic SNVs that likely arose during DNA amplification, library preparation, and/or sequence processing. We further evaluated PCR amplicon sequencing errors by assessing the mismatch rate (second allele frequency) in the overlapping region of the paired-end reads. We then used the 95th percentile of the mismatch rate distribution found for each type of base change as the sequencing error cutoff for base changes at the candidate positions. Finally, we excluded candidate mosaic SNVs with > 0.40 VAFs, as they likely represent germline SNPs.

We performed some additional final checks on the validated mosaic SNV events. First, we tested whether there was a positive correlation between the putative mosaic SNV VAFs in the WGS and PCR amplicon-based sequencing experiments (Fig. S13). Second, we tested whether there was a positive correlation between the ratios of the second highest allele frequency and the third highest allele frequency with the putative somatic SNV VAF in the WGS dataset (Fig. S13); the third highest allele frequency did not correlate with the variant allele frequency (Fig. S13).

To distinguish a true mosaic variant from a multiple displacement amplification artifact in single-cell sequencing, we adapted the method of Dong et al. [24]. First, we augmented their kernel smoothing method to utilize phased genotypes [48] as this improves estimation of allelic imbalance at a given locus. Phasing was performed on the GATK HaplotypeCaller germline variants from replicate 6 bulk brain tissue using Eagle [49] and the 1000 Genomes Phase 3 reference panel [50]. Second, we determined the optimal kernel size to maximize accuracy using PaSD-qc [51], which was determined to be 10 kb. Third, we instituted a likelihood ratio test to distinguish a first-round amplification artifact from a true mosaic variant; let f(X; p) be the probability of observing reads X where p is the expected proportion of alternate reads. Under the null hypothesis that the reads are generated due to amplification artifact, p = 0.125. Under the alternate hypothesis that the reads are generated due to a true mosaic variant, p = 0.125 + θ where p is the allelic balance estimated by kernel smoothing [24]. Since the models of the two hypotheses are nested, 2×logfX0.125+θ/fX0.125X12. We applied this model at all 400 sites at which we attempted amplicon validation across the 12 deeply sequenced single cells to obtain high-quality single-cell genotypes. Scripts for running this model are available at https://github.com/parklab/SCGenotyper_mini.

We used haplotype information provided by 10X Genomics linked read sequencing data to eliminate false-positive mosaic SNV calls and to provide support for bona fide mosaic SNVs. Briefly, the 10X Genomics linked read approach is a barcoded short-read sequencing technology, where individual HMW genomic DNA molecules (~ 50 kb) initially are attached to a bead then incorporated into droplets. Within each droplet, the HMW DNA fragments are fragmented into small pieces, assigned a unique barcode, and then undergo amplification. The resultant DNA libraries from each individual droplet then were subject to standard Illumina short-read sequencing. Short reads containing the same barcodes were aligned and re-assembled into larger linked reads using the LongRanger pipeline (https://support.10xgenomics.com/genome-exome/software/pipelines/latest/what-is-long-ranger). Overlapping individual HMW fragments that shared informative germline SNP alleles used reconstruct longer, continuous haplotype segments. This results in the assignment of sequencing reads derived from the initial HMW fragments to individual haplotypes. Using this information, we identified and removed candidate mosaic SNVs that were found on both parental haplotypes, which likely arose from systematic sequencing errors. Moreover, the haplotype information could be used to identify candidate mosaic SNVs that were represented > 90% of the reads on a single parental haplotype and are likely germline SNPs. By comparison, bona fide somatic SNVs are detected as a minor allele on a single parental haplotype. Tools for analyzing linked read BAMs are available at https://github.com/KiddLab/hapfilter-10X.

We performed ddPCR using the primers and probe listed in Additional file 4: Table S6. We initially designed 29 ddPCR assays to assess candidate mosaic SNVs present at < 3% VAFs and/or near know SVs (Additional file 1: Table S3). This initial set included (i) 4 mosaic SNVs at higher VAFs (12–28% in amplicon sequencing); (ii) 5 mosaic variants at lower VAFs (1–2.5% in amplicon sequencing); (iii) 3 SNVs deemed to be germline SNVs; and (iv) 17 false-positive calls (including 6 that were present at < 0.5% VAFs). The ddPCR reaction mixture (20 μL) contained 2 μL of bulk gDNA or NA12878 (25 ng) template, 250 nM of each primer (reference and assay specific pairs), and 250 nM TaqMan probe (Ref-HEX and assay specific-FAM dye labeled) in a 1× Bio-Rad Supermix (Bio-Rad, Hercules, CA, USA). The 20 μl PCR reaction was mixed with Bio-Rad droplet generator oil and partitioned into 15,000–20,000 droplets using the Bio-Rad QX-100 droplet generator (Bio-Rad). The generated droplets were transferred to a 96-well PCR reaction plate and sealed with a pierceable sealing foil. PCR conditions were 10 min at 95 °C, 40 cycles of denaturation for 30 s at 94 °C, assay specific anneal/extension temperatures 50–57.5 °C (as noted in table) for 60 s with ramp rate of 2.5 °C s− 1, followed by 10 min at 98 °C and a hold at 4 °C. Post-PCR amplification each droplet was checked for fluorescence to count the number of droplets that yielded positive/negative results, using the Bio-Rad QX-100 droplet reader (Bio-Rad) and frequencies calculated based on total droplet counts per well. In the end, we were able to design informative ddPCR validation tests for 13 candidate mosaic SNVs. For the remaining 16 variants, we could not design effective probes because the variants were within AT-rich or repetitive regions or the assay failed to separate clusters at various annealing temperatures.

We tested 383 sites using PCR amplicon-based sequencing (Fig. 3). Of these, 280 were determined to be false-positive calls, 64 had read support above background levels (eventually called as 9 germline SNVs, 17 false-positive calls, and 38 true mosaic SNVs), and 39 had a discordant validation status between two different BSMN groups (eventually called as 2 germline SNVs, 32 false-positive calls, and 5 true mosaic SNVs).

To further evaluate the discordant calls, we considered experiments yielding overlapping end sequence reads as more reliable because they resulted in higher-quality sequencing data (i.e., higher confidence base calls, reduced indels, and better mappability); these analyses allowed us to resolve 11/39 candidates. Other information that allowed the interrogation of the remaining calls included (1) determining the number of haplotypes observed in 10X genomics linked read sequencing datasets (i.e., “3 haplotypes” were observed for true-positive calls and “4 haplotypes” were observed for false-positive calls); (2) eliminating SNVs at a < 0.005 VAF in the PCR amplicon-based sequencing experiments (as such variants are unlikely be called from WGS data); (3) eliminating calls with fewer than 3 reads per WGS replicate; (4) eliminating variants within homopolymer tracts, CNVs, within 3 bps of indels, and near a germline Alu insertion (Fig. S8). We also applied the above criteria to invalidate 17/64 candidates with read support above background levels. Furthermore, 9/64 and 2/39 discordant candidates were germline SNVs, as their VAF approached 0.5 and only “2 haplotypes” were observed for the candidate calls in 10X linked read sequencing data. Following ddPCR, one SNV candidate initially assigned as a false positive was reassigned as a true SNV. Finally, based on the examination of 10X linked read sequencing datasets, one of the 17 candidates that lacked enough data from the PCR amplicon-based sequencing experiments (“NED”; not enough data) was determined to be a germline SNV. By consolidating all validation results, we arrived at a confident set of 43 mosaic SNVs.

We applied a PON mask for all the candidate somatic SNV sites to exclude the possible sequencing artifacts using the 1000 Genomes Project whole-genome sequencing libraries. We collected the counts for reference alleles and alternative alleles for each candidate site in each of the 2504 samples with a read quality cutoff at 20 and a base quality cutoff at 20. Candidate sites that have greater than 0.05 candidate allele frequency in more than 5 samples were identified as false positives. Thus, the PON mask filter out possible recurrent artifacts that are induced during sequencing or library preparation.

We consolidated the above filtering strategies into a naïve Bayes classifier and assessed our ability to characterize the 400 putative candidate mosaic SNVs using a leave-one-out cross-validation method. Briefly, we iteratively trained the model on validation results from 399 mosaic SNVs and predicted the validation status of the held-out SNV. This classifier uses features such as the allele count of a candidate somatic SNV across the 2504 samples in the 1000 Genomes Project as well as Chromium 10X linked read and single-cell properties to aid in the discrimination of true somatic SNVs from false-positive calls (see below). Given that trade-offs between maximizing the number of true-positive calls and excluding false-negative calls are inevitable in statistical modeling, we ultimately arrived at conservative calling parameters, where the classifier achieved 90% precision, 42% sensitivity, and 99% specificity. While this model suffers with respect to sensitivity, the chosen values maximize the rejection of false-positive calls at the expense of removing some false-negative calls (Additional file 1: Table S4). For example, false-negative calls were mostly present at low VAFs (median = 0.013) and lacked haplotype support in Chromium 10X linked read (i.e., 15/25 false-negative calls lacked haplotype support, whereas only 4/18 true-positive calls lacked haplotype support) and/or single-cell sequencing datasets (i.e., 22/25 false-negative calls lacked support in single-cell datasets as compared to 9/18 true-positive calls). Overall, these results were consistent with our prior observations and indicated that integrating various sequence characteristics can distinguish between bona fide mosaic SNVs and false-positive signals. Indeed, these criteria, as well as additional filtering strategies, have been consolidated for use in a recently developed machine learning algorithm, MosaicForecast [35].

We trained a naïve Bayes classifier to provide the probability that a given site will pass or fail amplicon validation based on the properties of the site in the 75 WGS panel of normal samples, 10X sequencing from the common brain tissue, and single-cell whole-genome sequencing. Given a set of training sites, we calculated the conditional likelihood of a site having the following properties given its validation status (pass or fail):

Panel of normal properties:

○ Presence of alternate reads in ≤5 of the 75 panel of normal samples

○ Presence of alternate reads in >5 of the 75 panel of normal samples

10X properties:

○ PASS 10X filters (see above Methods)

○ FAIL 10X filters for any reason

○ No 10X information available due to lack of alternate reads

Single cell properties:

○ No alternate reads present in any single cells

○ Alternate reads present in <4 single cells

○ Alternate reads present in ≥4 single cells

○ Locus flagged as PASS by single cell genotyping method

○ Locus flagged as LOWQUAL by single cell genotyping method

○ Locus flagged as ARTIF by single cell genotyping method

○ Locus has no flag assigned by single cell genotyping algorithm

The conditional likelihood was calculated as the number of sites exhibiting that property and having a certain validation status divided by the number of sites with that validation status. To prevent events of probability zero, each category was given a prior pseudocount of 1. The prior probability of passing (failing) was calculated as the number of test sites in the training data which passed (failed) amplicon validation divided by the number of test sites in the training data.

Let B be a binary random variable which takes on the value 0 if the site fails validation and 1 if it passes validation. Let S be the set of properties given above for some new locus:

The sensitivity and specificity of this approach was calculated using leave-one-out cross-validation on the 400 sites.

The best practices workflow was implemented as a computational pipeline composed of a series of bash job scripts compatible with the job scheduler and Sun Grid Engine. The software includes the steps of uniform data processing, variant calling with GATK at various ploidies, and variant filtering. It is accessible at https://github.com/bsmn/bsmn-pipeline. Although the pipeline is executable at any cluster system using Sun Grid Engine as a job scheduler, we used AWS ParallelCluster (https://github.com/aws/aws-parallelcluster) as the running environment for the BSMN project. The configuration used for setting up the ParallelCluster is available at https://github.com/bintriz/bsmn-aws-setup.

To reconstruct a cell lineage tree from single-cell data, we examined whether the union set of 49 SNVs identified in our validation call set and called by the application of the best practices workflow to DLPFC replicate 6 WGS were shared among twelve NeuN+ neuron single-cell datasets. An SNV was considered “present” if we could find at least one supporting read in a single cell. The VAFs of the 49 SNVs also were estimated from the following WGS datasets: DLPFC (4 samples), NeuN+ (2 datasets), and NeuN− (1 dataset) DLPFC cell fractions, cerebellum (1 dataset), dura mater (1 dataset), and dural fibroblasts (2 datasets). We conducted hierarchical clustering of SNVs based upon their VAFs in different tissues using average linkage with Canberra distance. Clustering of single cells was performed by assessing shared genotypes using Jaccard distance. The final lineage tree was constructed by manual inspection of shared genotypes in each cell and the SNV VAF values across tissues. SNVs defining earlier developmental lineages have higher VAFs than SNVs defining later developmental sub-lineages. Six SNVs with highest VAFs clustered together and were present in two distinct groups of single cells (the L1 and L2 lineages). We found that 4/6 SNVs (SNVs 1–4) denote the L1 lineage and 2/6 SNVs (SNVs 10 and 11) denote the L2 lineage. This analysis was conducted using the SciPy package of python. Notably, one single-cell dataset lacked the 49 SNVs and was excluded from the analysis.

Two calls (SNV6 and SNV16) were not validated in our PCR amplicon-based sequencing experiments, but passed the 10X Genomics linked read haplotype test. Further analyses indicated that SNV6 and SNV16 exhibited consistent VAFs across multiple sequencing replicates, that no apparent alignment or sequencing artifacts surround the SNV site, and that these SNVs defined sub-branches of L1 and L2 lineages. Moreover, SNV16 appeared together with SNV17 in a single cell and had similar VAFs, suggesting they represent markers for a sub-branch of the L2 lineage and, consequently, SNV16 is a bona fide mosaic SNV.

To identify mosaic SNV pairs that putatively were marking different cell lineages in WGS bulk DNA, we calculated all possible pairwise VAF anti-correlation values for the 49 somatic mosaic SNVs identified in our validation set and best practices pipeline. For each pairwise comparison, we calculated the mean distance of VAFs of the SNV pair across multiple tissue samples according to the diagonal line: VAF1 + VAF2 = 0.5. We then used that distance as a score to rank SNV pairs by how well they fit the diagonal line (the lower score is better). The eight best scores corresponded to SNV pairs, where one represented the L1 lineage and the other one represented the L2 lineage. For seven pairs, the scores were significantly different (< 0.05) from the other tested pairs (Fig. S12).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A