Estimation of sequencing error and polishing error

Chentao Yang; Yang Zhou; Stephanie Marcus; Giulio Formenti; Lucie A. Bergeron; Zhenzhen Song; Xupeng Bi; Juraj Bergman; Marjolaine Marie C. Rousselle; Chengran Zhou; Long Zhou; Yuan Deng; Miaoquan Fang; Duo Xie; Yuanzhen Zhu; Shangjin Tan; Jacquelyn Mountcastle; Bettina Haase; Jennifer Balacco; Jonathan Wood; William Chow; Arang Rhie; Martin Pippel; Margaret M. Fabiszak; Sergey Koren; Olivier Fedrigo; Winrich A. Freiwald; Kerstin Howe; Huanming Yang; Adam M. Phillippy; Mikkel Heide Schierup; Erich D. Jarvis; Guojie Zhang

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Estimation of sequencing error and polishing error

CY Chentao Yang

YZ Yang Zhou

SM Stephanie Marcus

GF Giulio Formenti

LB Lucie A. Bergeron

ZS Zhenzhen Song

XB Xupeng Bi

JB Juraj Bergman

MR Marjolaine Marie C. Rousselle

CZ Chengran Zhou

LZ Long Zhou

YD Yuan Deng

MF Miaoquan Fang

DX Duo Xie

YZ Yuanzhen Zhu

ST Shangjin Tan

JM Jacquelyn Mountcastle

BH Bettina Haase

JB Jennifer Balacco

JW Jonathan Wood

WC William Chow

AR Arang Rhie

MP Martin Pippel

MF Margaret M. Fabiszak

SK Sergey Koren

OF Olivier Fedrigo

WF Winrich A. Freiwald

KH Kerstin Howe

HY Huanming Yang

AP Adam M. Phillippy

MS Mikkel Heide Schierup

EJ Erich D. Jarvis

GZ Guojie Zhang

This method is extracted from research article: Nature, Apr 2021

Evolutionary and biomedical insights from a marmoset diploid genome assembly

DOI: 10.1038/s41586-021-03535-x

Request a Protocol

Ask a question

Favorite

To calculate sequencing errors and polishing errors, we established a confident SNP set as a criterion. We used three individual approaches to detect SNPs between two haplotypes: (1) retrieved heterozygous sites from the Mummer alignment between the maternal and paternal haplotypes excluding the sex chromosomes (setA, containing 3.48 million SNVs); (2) GATK pipeline based on mapping of 10X linked-reads from the F₁ offspring (setB); and (3) SAMTools (v.1.8) mpileup followed by bcftools also based on 10X linked-reads mapping (setC). Then, a raw SNP dataset was generated by a two-step procedure: first taking the intersection of setB and setC to generate Set1 (3.72 million SNVs), followed by taking the union of setA and Set1 to get Set2 (3.77 million SNVs). We then took these two sets and selected among them to a high-quality 3.58-million SNP Set3 (Supplementary Fig. 10) with the following criteria applied: (1) 10X linked-read depth lower than 10; (2) filter out sites that do not align to the two haplotype assemblies; (3) filter out sites that we could not call a typical haplotype on the basis of much less than 50% nucleotide distribution (π > 0.4 and the third highest depth >5, in which π is calculated as: $π = 2 \times (A T + A C + A G + T C + T G + C G) / (Totaldepth \times (Totaldepth - 1))$

and A, T, C and G represent the sequencing depth of base A, T, C and G for each site. For example, a distribution of ‘A:20; T:20; C:14; G:0’ indicates a complex condition. We also collected the mapping information from raw PacBio reads and corrected PacBio reads. This allowed us to establish an evidence chain of how the bases in each haplotype changed during assembling and polishing, which allowed us to classify different error types. We classified 195,751 sequencing error sites and 180,712 polishing error sites. The sequencing and polishing error rates were estimated to be 3.41 × 10⁻⁵ and 3.66 × 10⁻⁵, respectively. We further validated the variants with PCR experiments (Supplementary Note).

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol