We assume that there are N reads obtained from both chromosomes. For a haplotype with the length of l, an N × l fragment matrix R is constructed whose rows embed the reads and whose columns correspond to the heterozygous SNP sites [19, 20]. The SNP sites not covered by the reads are coded with zero. Then, bases of reads are converted to −1 (alternative allele) or 1 (reference allele), assuming bi-allelic SNPs.
As an example of an error-free case, consider the first exon of HLA-A, a gene on chromosome 6 -with NCBI reference sequence number NG_029217.2. Its first 40 bases are presented in Fig 1a. It contains five bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). The procedure of constructing the fragment matrix is depicted in Fig 1d. In this example, the exact haplotypes that should be reconstructed by the haplotype assembly algorithms are {CGTAG} and {TCGCA}.
This gene is located on chromosome 6 with NCBI reference sequence number NG_029217.2. It contains 5 bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). a) An example of homologous chromosomes in which the SNP sites are indicated in bold, b) an example of aligned reads, c) the fragments after removing non-informative reads and non-SNP bases and d) the constructed fragment matrix.
The fragment matrix R can be modeled using a matrix completion approach [19, 20]. In the error-free case, R is a partially observed matrix modelled as
where M is the completed version of matrix R (see section B of S1 Appendix for more details). PΩ is the observation operator defined as
in which Ω is the set of indices of known entries. In order to generalize the model to the more realistic case allowing erroneous entries, we use an additive measurement error model inspired by [11, 19, 20]:
To define the error matrix E, we should first clarify what we mean by an error. A substitution error is the conversion of a DNA base to one of the other three possible bases during the sequencing procedure. As mentioned earlier, during fragment matrix construction, only two bases (reference and alternative alleles) for each SNP site are permitted and other possible bases are ignored; as a result, a substitution to the ignored bases does not affect the entries of the fragment matrix. Accordingly, we introduce the term bi-allelic substitution, or simply bi-substitution to make it distinguishable from generally defined substitution. A bi-substitution error occurs when a reference allele is converted to the alternative allele or vice versa. Consequently, an error in the entries of PΩ(M) is simplified as a change from −1 to 1 or vice versa. This can be formulated as an addition of 2 (or −2) to each erroneous entry of PΩ(M) which is represented in error matrix E. We assumed that each non-zero entry of R is erroneous with a probability of pe, the bi-substitution error probability, independent of the other entries. This value equals one third of the substitution error probability of the sequencing device ps.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.