Our method estimates nine different error rates for each individual, as shown in Fig. 11. Family data allows us to detect some sequencing errors because they produce non-Mendelian observations in the family, as shown in Fig. 1. By modelling the frequency of these non-Mendelian observations, we can estimate per-individual error distributions and estimate the total number of sequencing errors in the dataset.
We estimate detailed error distributions for each genotype in each individual. The./. observation represents missing data
Let be a random variable representing the observed variant call for individual i at a biallelic site with ground-truth genotype g∈{0/0, 0/1, 1/1}. Sequencing errors can cause , so our goal is to estimate the distribution of within a genomic dataset. Specifically, we would like to estimate with c∈{0/0, 0/1, 1/1,./.} for all g, c, and i. The./. observation represents a site where the variant caller was unable to assign a genotype to the individual. By modeling these missing sites, we are able to estimate the rate of missing data for each individual while we estimate the other error rates. Here we make three main assumptions in order to simplify modelling:
We assume sequencing errors are rare, so is very small.
We assume that all observations of Mendelian errors in a family are the result of sequencing error. This may not be true in the case of de novo variants or variants falling within inherited deletions, duplications, or other structural variants. However, we expect this assumption to hold over the majority of the genome.
We assume each sequencing error occurs independently in different family members, so the chance of observing multiple sequencing errors at the same site within the same family is vanishingly small. This may not be true in repetitive or otherwise hard-to-sequence regions, but we expect these special cases to be infrequent.
We define a family genotype as a tuple of genotypes, representing the genotypes of a mother, father, and their child(ren), respectively, at a given site. For example (0/0, 0/1, 0/1, 0/0) is a family genotype for a family of four where the mother is homozygous reference, father heterozygous, first child heterozygous, and second child homozygous reference. Some family genotypes are valid, meaning they contain no missing genotypes and obey Mendelian inheritance. Let represent the set of valid family genotypes and let represent the set of invalid family genotypes. For example, (0/0, 0/1, 0/1, 0/0) is valid. However, (0/0, 0/0, 0/1, 0/0) is invalid because both parents are homozygous reference, but one of the children has a variant.
We can represent any sequencing dataset as a set of family genotypes. Let xj represent the ground-truth number of occurrences of family genotype j, if we could sequence perfectly without any sequencing error or missing data. We do not have access to xj. Instead, we have access to yj, the number of times we observe family genotype j in our dataset, in the presence of sequencing error and missing data. Since we assume that all sites obey Mendelian inheritance, for all invalid family genotypes . However sequencing error may cause yw>0.
Let pv→w represent the probability that sequencing errors cause valid family genotype v to be observed as invalid family genotype w. We model Yw, a random variable representing the number of times we observe the invalid family genotype w, using Yw to denote a random variable and lowercase yw to denote a realization of that random variable (in this case, our observations). Assuming sequencing errors are rare, we can apply a generalization of Le Cam’s theorem [27] to show that the Yws, as sums of multinomials, are approximately distributed as independent Poissons.
The error of the approximation is bounded by where δv is the probability of a sequencing error occurring at a site with family genotype v. Since sequencing errors are rare, we expect δv to be very small for all v, so the approximation is quite good.
We would like to use our Poisson approximation to develop a maximum likelihood estimate for each . Since we assume that the chance of multiple errors occurring at the same site within the same family is vanishingly small, pv→w≠0 only if v and w differ for only a single family member. In this case, we call v and w neighbors. Every pair of neighboring genotypes has a corresponding where i is the index of the family member that has different genotypes in v and w, g is the genotype of family member i in v, and c is the genotype of family member i in w. For example, family genotype (0/0,0/0,0/1,0/0) has only three valid neighbors: (0/0,0/1,0/1,0/0),(0/0,0/0,0/0,0/0), and (0/1,0/0,0/1,0/0). Y(0/0,0/0,0/1,0/0) is therefore distributed as:
We do not have access to xv, the ground truth number of occurrences of valid family genotype v. However, since sequencing errors are rare, we assume most valid family genotypes are observed correctly, so we can use yv as an approximation of xv. Since our model is linear in the parameters of interest, Poisson regression will produce a maximum likelihood estimate of each .
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.