Detection of SNPs, indels and SVs using whole-haplotype genome alignment

Chentao Yang; Yang Zhou; Stephanie Marcus; Giulio Formenti; Lucie A. Bergeron; Zhenzhen Song; Xupeng Bi; Juraj Bergman; Marjolaine Marie C. Rousselle; Chengran Zhou; Long Zhou; Yuan Deng; Miaoquan Fang; Duo Xie; Yuanzhen Zhu; Shangjin Tan; Jacquelyn Mountcastle; Bettina Haase; Jennifer Balacco; Jonathan Wood; William Chow; Arang Rhie; Martin Pippel; Margaret M. Fabiszak; Sergey Koren; Olivier Fedrigo; Winrich A. Freiwald; Kerstin Howe; Huanming Yang; Adam M. Phillippy; Mikkel Heide Schierup; Erich D. Jarvis; Guojie Zhang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Detection of SNPs, indels and SVs using whole-haplotype genome alignment

CY Chentao Yang

YZ Yang Zhou

SM Stephanie Marcus

GF Giulio Formenti

LB Lucie A. Bergeron

ZS Zhenzhen Song

XB Xupeng Bi

JB Juraj Bergman

MR Marjolaine Marie C. Rousselle

CZ Chengran Zhou

LZ Long Zhou

YD Yuan Deng

MF Miaoquan Fang

DX Duo Xie

YZ Yuanzhen Zhu

ST Shangjin Tan

JM Jacquelyn Mountcastle

BH Bettina Haase

JB Jennifer Balacco

JW Jonathan Wood

WC William Chow

AR Arang Rhie

MP Martin Pippel

MF Margaret M. Fabiszak

SK Sergey Koren

OF Olivier Fedrigo

WF Winrich A. Freiwald

KH Kerstin Howe

HY Huanming Yang

AP Adam M. Phillippy

MS Mikkel Heide Schierup

EJ Erich D. Jarvis

GZ Guojie Zhang

This method is extracted from research article: Nature, Apr 2021

Evolutionary and biomedical insights from a marmoset diploid genome assembly

DOI: 10.1038/s41586-021-03535-x

Request a Protocol

Ask a question

Favorite

To call heterozygous sites between the two haploid sequences, independent of the GenomeScope calculation, we first performed a Mummer (v.3.23) alignment with the parameters of ‘nucmer -maxmatch -l 100 -c 500’. Because our assemblies span most repetitive sequences, repeat-masking treatment was not necessary before conducting the Mummer alignment. A series of custom scripts (https://github.com/comery/marmoset) identified and sorted our SNPs and indels in the alignments. We used svmu (v.0.4-alpha)⁷¹, Assemblytics (v.1.2)⁷², and SyRi (v.1.0)⁷³, to detect SVs from Mummer alignment. After several test rounds, we found that svmu reported more accurate large indels, and Assemblytics detected CNVs, particularly tandem repeats, whereas SyRi detected other SVs well. We used these three methods and combined the results as confident SVs. We used default parameters for svmu, Assemblytics, and recommended nucmer alignment for SyRi (https://schneebergerlab.github.io/syri/).

To generate a high-quality SV dataset, we manually checked all inversions and translocations with the following steps: (1) clip 300 bp of upstream/downstream flanking sequence of each break point between the two haplotypes, blast against local PacBio reads with threshold identity >96% and aligned length >550 bp, and require the SV region where the maternal and paternal sequences aligned to have high similarity (>90%); (2) if (1) failed, then check the 10X linked-read count between a 5-kb flanking region; (3) if any break point is not supported by 10X linked-reads, check the Hi-C heat map of this region; if it shows an inversion or translocation pattern on heat map or an ambiguous situation, then remove it.

To evaluate the accuracy of SV detection, we searched the binned PacBio reads around the break points of both maternal and paternal assemblies for all indels in chromosome 1. We looked for one of the following three features to determine the indel as accurate: (1) at least one single PacBio long read from each haplotype that spans the entire indel region with the variation found in each haplotype; (2) overlapping PacBio reads that span the two break points; or (3) manually validated PacBio read alignment by the Integrative Genomics Viewer (IGV)⁷⁴. Finally, we found that 95.7% of indels are correct when considering the breakage location; however, 74.2% are accurate when considering both boundary and location.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

1 Q&A

Thank you very much for providing the protocol. I was wondering that how to blast against local PacBio reads? Which software or script was used?

0 Answer 2 Views May 20, 2023

Share your protocol with your peers.

Submit a Preprint Protocol