Evaluating MonSTR on simulated WGS data

Ileena Mitra; Bonnie Huang; Nima Mousavi; Nichole Ma; Michael Lamkin; Richard Yanicky; Sharona Shleizer-Burko; Kirk E. Lohmueller; Melissa Gymrek

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Evaluating MonSTR on simulated WGS data

IM Ileena Mitra

BH Bonnie Huang

NM Nima Mousavi

NM Nichole Ma

ML Michael Lamkin

RY Richard Yanicky

SS Sharona Shleizer-Burko

KL Kirk E. Lohmueller

MG Melissa Gymrek

This method is extracted from research article: Jan 2021

Patterns of de novo tandem repeat mutations and their role in autism

DOI: 10.1038/s41586-020-03078-7

Ask a question

Favorite

We created 78 quad families with 100 TR loci randomly selected from TRs passing all filters described above in the SSC cohort. One simulated quad family consists of the father, mother, child with known mutation (proband), and child with no mutation (control). We tested the ability of our entire pipeline to genotype TRs with GangSTR and call de novo mutations with MonSTR. To test the effect of depth of coverage, we generated datasets with 1-50x mean coverage with a mutation size of +1 or −1 repeat unit changes in the proband. To test the effect of TR mutation size, we generated WGS data with 40x coverage and mutations in probands ranging from −10 to 30 repeat unit changes. Contraction mutations that would have resulted in negative repeat copy numbers were excluded. For both tests, we simulated data under three scenarios: (1) both parents with homozygous reference TR genotypes, (2) one parent heterozygous, (3) both parents heterozygous (Extended Data Fig. 1).

WGS data were simulated using ART_illumina³⁴ v2.5.8 with non-default parameters -ss HS25 (HiSeq 2500 simulation profile), -l 150 (150b reads), -p (paired-end reads), -f coverage (coverage was set as described above), -m 500 (mean fragment size) and -s 100 (standard deviation of fragment size). ART_illumina was applied to fasta files generated from 10Kb windows surrounding each TR locus, applying any mutations as described above. The resulting fastq files were aligned to the hg38 reference genome using bwa mem³⁵ v0.7.12-r1039 with non-default parameter -R “@RG\tID:sample_id\tSM:sample_id”, which sets the read group tag ID and sample name to sample_id for each simulated sample. TRs were genotyped from aligned reads jointly across all members of the same family with GangSTR using identical settings to those applied to SSC data.

We tested three mutation calling settings: a naïve mutation calling method based on hard genotype calls, MonSTR using default parameters, and MonSTR using an identical set of filters as applied to SSC data. We found overall all methods perform similarly well above 30x coverage. At lower coverage, MonSTR’s model-based method achieves reduced sensitivity but greater specificity compared to a naïve mutation calling pipeline (Extended Data Fig. 1).

Reprints and permissions information is available at www.nature.com/reprints.Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research, subject always to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol