Assessing accuracy on truth data sets

KL Kristen M. Laricchia
NL Nicole J. Lake
NW Nicholas A. Watts
MS Megan Shand
AH Andrea Haessly
LG Laura Gauthier
DB David Benjamin
EB Eric Banks
JS Jose Soto
KG Kiran Garimella
JE James Emery
HR Heidi L. Rehm
DM Daniel G. MacArthur
GT Grace Tiao
ML Monkol Lek
VM Vamsi K. Mootha
SC Sarah E. Calvo
request Request a Protocol
ask Ask a question
Favorite

Sample NA12878 and 22 samples from diverse L haplogroups were selected for in silico mixing experiments to create a large truth data set compared to the reference Chr M (totaling 1200 variants at 286 positions, including eight indels). For each L haplogroup sample, the number of mtDNA reads per sample was counted (SAMtools v1.8 idxstats [Li et al. 2009]), and then downsampling was performed (SAMtools v1.8) to create five BAM files containing a predefined ratio of reads from the L haplogroup sample and NA12878 (1%, 5%, 50%, 90%, 99%). For each mixture, total coverage was set to the L haplogroup sample's original coverage. GATK's HaplotypeCaller version 4.0.3.0 was used to call homoplasmic variants on the original BAMs before downsampling, with the ploidy argument set to 100. For each L haplogroup sample, a truth set was defined as variants present in the L haplogroup sample (allele count > 94/100) but absent in NA12878 (based on manual review using overlapping read pair data, with padding of 1 bp around each NA12878 variant). For each L haplogroup sample mixture, true- and false-positive calls were calculated against the sample-specific truth set and then summed across all 22 L haplogroup samples to create sensitivity and precision metrics.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A