Evaluating MonSTR on simulated WGS data

IM Ileena Mitra
BH Bonnie Huang
NM Nima Mousavi
NM Nichole Ma
ML Michael Lamkin
RY Richard Yanicky
SS Sharona Shleizer-Burko
KL Kirk E. Lohmueller
MG Melissa Gymrek
ask Ask a question
Favorite

We created 78 quad families with 100 TR loci randomly selected from TRs passing all filters described above in the SSC cohort. One simulated quad family consists of the father, mother, child with known mutation (proband), and child with no mutation (control). We tested the ability of our entire pipeline to genotype TRs with GangSTR and call de novo mutations with MonSTR. To test the effect of depth of coverage, we generated datasets with 1-50x mean coverage with a mutation size of +1 or −1 repeat unit changes in the proband. To test the effect of TR mutation size, we generated WGS data with 40x coverage and mutations in probands ranging from −10 to 30 repeat unit changes. Contraction mutations that would have resulted in negative repeat copy numbers were excluded. For both tests, we simulated data under three scenarios: (1) both parents with homozygous reference TR genotypes, (2) one parent heterozygous, (3) both parents heterozygous (Extended Data Fig. 1).

WGS data were simulated using ART_illumina34 v2.5.8 with non-default parameters -ss HS25 (HiSeq 2500 simulation profile), -l 150 (150b reads), -p (paired-end reads), -f coverage (coverage was set as described above), -m 500 (mean fragment size) and -s 100 (standard deviation of fragment size). ART_illumina was applied to fasta files generated from 10Kb windows surrounding each TR locus, applying any mutations as described above. The resulting fastq files were aligned to the hg38 reference genome using bwa mem35 v0.7.12-r1039 with non-default parameter -R “@RG\tID:sample_id\tSM:sample_id”, which sets the read group tag ID and sample name to sample_id for each simulated sample. TRs were genotyped from aligned reads jointly across all members of the same family with GangSTR using identical settings to those applied to SSC data.

We tested three mutation calling settings: a naïve mutation calling method based on hard genotype calls, MonSTR using default parameters, and MonSTR using an identical set of filters as applied to SSC data. We found overall all methods perform similarly well above 30x coverage. At lower coverage, MonSTR’s model-based method achieves reduced sensitivity but greater specificity compared to a naïve mutation calling pipeline (Extended Data Fig. 1).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A