Estimation of sequencing error and polishing error

CY Chentao Yang
YZ Yang Zhou
SM Stephanie Marcus
GF Giulio Formenti
LB Lucie A. Bergeron
ZS Zhenzhen Song
XB Xupeng Bi
JB Juraj Bergman
MR Marjolaine Marie C. Rousselle
CZ Chengran Zhou
LZ Long Zhou
YD Yuan Deng
MF Miaoquan Fang
DX Duo Xie
YZ Yuanzhen Zhu
ST Shangjin Tan
JM Jacquelyn Mountcastle
BH Bettina Haase
JB Jennifer Balacco
JW Jonathan Wood
WC William Chow
AR Arang Rhie
MP Martin Pippel
MF Margaret M. Fabiszak
SK Sergey Koren
OF Olivier Fedrigo
WF Winrich A. Freiwald
KH Kerstin Howe
HY Huanming Yang
AP Adam M. Phillippy
MS Mikkel Heide Schierup
EJ Erich D. Jarvis
GZ Guojie Zhang
request Request a Protocol
ask Ask a question
Favorite

To calculate sequencing errors and polishing errors, we established a confident SNP set as a criterion. We used three individual approaches to detect SNPs between two haplotypes: (1) retrieved heterozygous sites from the Mummer alignment between the maternal and paternal haplotypes excluding the sex chromosomes (setA, containing 3.48 million SNVs); (2) GATK pipeline based on mapping of 10X linked-reads from the F1 offspring (setB); and (3) SAMTools (v.1.8) mpileup followed by bcftools also based on 10X linked-reads mapping (setC). Then, a raw SNP dataset was generated by a two-step procedure: first taking the intersection of setB and setC to generate Set1 (3.72 million SNVs), followed by taking the union of setA and Set1 to get Set2 (3.77 million SNVs). We then took these two sets and selected among them to a high-quality 3.58-million SNP Set3 (Supplementary Fig. 10) with the following criteria applied: (1) 10X linked-read depth lower than 10; (2) filter out sites that do not align to the two haplotype assemblies; (3) filter out sites that we could not call a typical haplotype on the basis of much less than 50% nucleotide distribution (π > 0.4 and the third highest depth >5, in which π is calculated as: π=2×(AT+AC+AG+TC+TG+CG)/(Totaldepth×(Totaldepth1))

and A, T, C and G represent the sequencing depth of base A, T, C and G for each site. For example, a distribution of ‘A:20; T:20; C:14; G:0’ indicates a complex condition. We also collected the mapping information from raw PacBio reads and corrected PacBio reads. This allowed us to establish an evidence chain of how the bases in each haplotype changed during assembling and polishing, which allowed us to classify different error types. We classified 195,751 sequencing error sites and 180,712 polishing error sites. The sequencing and polishing error rates were estimated to be 3.41 × 10−5 and 3.66 × 10−5, respectively. We further validated the variants with PCR experiments (Supplementary Note).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A