Detection of SNPs, indels and SVs using whole-haplotype genome alignment

CY Chentao Yang
YZ Yang Zhou
SM Stephanie Marcus
GF Giulio Formenti
LB Lucie A. Bergeron
ZS Zhenzhen Song
XB Xupeng Bi
JB Juraj Bergman
MR Marjolaine Marie C. Rousselle
CZ Chengran Zhou
LZ Long Zhou
YD Yuan Deng
MF Miaoquan Fang
DX Duo Xie
YZ Yuanzhen Zhu
ST Shangjin Tan
JM Jacquelyn Mountcastle
BH Bettina Haase
JB Jennifer Balacco
JW Jonathan Wood
WC William Chow
AR Arang Rhie
MP Martin Pippel
MF Margaret M. Fabiszak
SK Sergey Koren
OF Olivier Fedrigo
WF Winrich A. Freiwald
KH Kerstin Howe
HY Huanming Yang
AP Adam M. Phillippy
MS Mikkel Heide Schierup
EJ Erich D. Jarvis
GZ Guojie Zhang
request Request a Protocol
ask Ask a question
Favorite

To call heterozygous sites between the two haploid sequences, independent of the GenomeScope calculation, we first performed a Mummer (v.3.23) alignment with the parameters of ‘nucmer -maxmatch -l 100 -c 500’. Because our assemblies span most repetitive sequences, repeat-masking treatment was not necessary before conducting the Mummer alignment. A series of custom scripts (https://github.com/comery/marmoset) identified and sorted our SNPs and indels in the alignments. We used svmu (v.0.4-alpha)71, Assemblytics (v.1.2)72, and SyRi (v.1.0)73, to detect SVs from Mummer alignment. After several test rounds, we found that svmu reported more accurate large indels, and Assemblytics detected CNVs, particularly tandem repeats, whereas SyRi detected other SVs well. We used these three methods and combined the results as confident SVs. We used default parameters for svmu, Assemblytics, and recommended nucmer alignment for SyRi (https://schneebergerlab.github.io/syri/).

To generate a high-quality SV dataset, we manually checked all inversions and translocations with the following steps: (1) clip 300 bp of upstream/downstream flanking sequence of each break point between the two haplotypes, blast against local PacBio reads with threshold identity >96% and aligned length >550 bp, and require the SV region where the maternal and paternal sequences aligned to have high similarity (>90%); (2) if (1) failed, then check the 10X linked-read count between a 5-kb flanking region; (3) if any break point is not supported by 10X linked-reads, check the Hi-C heat map of this region; if it shows an inversion or translocation pattern on heat map or an ambiguous situation, then remove it.

To evaluate the accuracy of SV detection, we searched the binned PacBio reads around the break points of both maternal and paternal assemblies for all indels in chromosome 1. We looked for one of the following three features to determine the indel as accurate: (1) at least one single PacBio long read from each haplotype that spans the entire indel region with the variation found in each haplotype; (2) overlapping PacBio reads that span the two break points; or (3) manually validated PacBio read alignment by the Integrative Genomics Viewer (IGV)74. Finally, we found that 95.7% of indels are correct when considering the breakage location; however, 74.2% are accurate when considering both boundary and location.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
1 Q&A