Estimating sequencing error rates

Kelley Paskov; Jae-Yoon Jung; Brianna Chrisman; Nate T. Stockham; Peter Washington; Maya Varma; Min Woo Sun; Dennis P. Wall

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Estimating sequencing error rates

KP Kelley Paskov

JJ Jae-Yoon Jung

BC Brianna Chrisman

NS Nate T. Stockham

PW Peter Washington

MV Maya Varma

MS Min Woo Sun

DW Dennis P. Wall

This method is extracted from research article: BioData Min, Apr 2021

Estimating sequencing error rates using families

DOI: 10.1186/s13040-021-00259-6

Ask a question

Favorite

Our method estimates nine different error rates for each individual, as shown in Fig. 11. Family data allows us to detect some sequencing errors because they produce non-Mendelian observations in the family, as shown in Fig. 1. By modelling the frequency of these non-Mendelian observations, we can estimate per-individual error distributions and estimate the total number of sequencing errors in the dataset.

We estimate detailed error distributions for each genotype in each individual. The./. observation represents missing data

Let $C_{g}^{(i)}$ be a random variable representing the observed variant call for individual i at a biallelic site with ground-truth genotype g∈{0/0, 0/1, 1/1}. Sequencing errors can cause $C_{g}^{(i)} \neq g$ , so our goal is to estimate the distribution of $C_{g}^{(i)}$ within a genomic dataset. Specifically, we would like to estimate $P (C_{g}^{(i)} = c)$ with c∈{0/0, 0/1, 1/1,./.} for all g, c, and i. The./. observation represents a site where the variant caller was unable to assign a genotype to the individual. By modeling these missing sites, we are able to estimate the rate of missing data for each individual while we estimate the other error rates. Here we make three main assumptions in order to simplify modelling:

We assume sequencing errors are rare, so $P (C_{g}^{(i)} \neq g)$ is very small.

We assume that all observations of Mendelian errors in a family are the result of sequencing error. This may not be true in the case of de novo variants or variants falling within inherited deletions, duplications, or other structural variants. However, we expect this assumption to hold over the majority of the genome.

We assume each sequencing error occurs independently in different family members, so the chance of observing multiple sequencing errors at the same site within the same family is vanishingly small. This may not be true in repetitive or otherwise hard-to-sequence regions, but we expect these special cases to be infrequent.

We define a family genotype as a tuple of genotypes, representing the genotypes of a mother, father, and their child(ren), respectively, at a given site. For example (0/0, 0/1, 0/1, 0/0) is a family genotype for a family of four where the mother is homozygous reference, father heterozygous, first child heterozygous, and second child homozygous reference. Some family genotypes are valid, meaning they contain no missing genotypes and obey Mendelian inheritance. Let $V$ represent the set of valid family genotypes and let $W$ represent the set of invalid family genotypes. For example, (0/0, 0/1, 0/1, 0/0) is valid. However, (0/0, 0/0, 0/1, 0/0) is invalid because both parents are homozygous reference, but one of the children has a variant.

We can represent any sequencing dataset as a set of family genotypes. Let x_j represent the ground-truth number of occurrences of family genotype j, if we could sequence perfectly without any sequencing error or missing data. We do not have access to x_j. Instead, we have access to y_j, the number of times we observe family genotype j in our dataset, in the presence of sequencing error and missing data. Since we assume that all sites obey Mendelian inheritance, for all invalid family genotypes $w \in W, x_{w} = 0$ . However sequencing error may cause y_w>0.

Let p_v→w represent the probability that sequencing errors cause valid family genotype v to be observed as invalid family genotype w. We model Y_w, a random variable representing the number of times we observe the invalid family genotype w, using Y_w to denote a random variable and lowercase y_w to denote a realization of that random variable (in this case, our observations). Assuming sequencing errors are rare, we can apply a generalization of Le Cam’s theorem [27] to show that the Y_ws, as sums of multinomials, are approximately distributed as independent Poissons.

The error of the approximation is bounded by $2 \sum_{v \in V} x_{v} δ_{v}^{2}$ where δ_v is the probability of a sequencing error occurring at a site with family genotype v. Since sequencing errors are rare, we expect δ_v to be very small for all v, so the approximation is quite good.

We would like to use our Poisson approximation to develop a maximum likelihood estimate for each $P (C_{g}^{(i)} = c)$ . Since we assume that the chance of multiple errors occurring at the same site within the same family is vanishingly small, p_v→w≠0 only if v and w differ for only a single family member. In this case, we call v and w neighbors. Every pair of neighboring genotypes has a corresponding $P (C_{g}^{(i)} = c)$ where i is the index of the family member that has different genotypes in v and w, g is the genotype of family member i in v, and c is the genotype of family member i in w. For example, family genotype (0/0,0/0,0/1,0/0) has only three valid neighbors: (0/0,0/1,0/1,0/0),(0/0,0/0,0/0,0/0), and (0/1,0/0,0/1,0/0). Y_{(0/0,0/0,0/1,0/0)} is therefore distributed as:

We do not have access to x_v, the ground truth number of occurrences of valid family genotype v. However, since sequencing errors are rare, we assume most valid family genotypes are observed correctly, so we can use y_v as an approximation of x_v. Since our model is linear in the parameters of interest, Poisson regression will produce a maximum likelihood estimate of each $P (C_{g}^{(i)} = c)$ .

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol