Fragment matrix model

Sina Majidian; Mohammad Hossein Kahaei; Dick de Ridder

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Fragment matrix model

SM Sina Majidian

MK Mohammad Hossein Kahaei

DR Dick de Ridder

This method is extracted from research article: PLoS One, Jun 2020

Minimum error correction-based haplotype assembly: Considerations for long read data

DOI: 10.1371/journal.pone.0234470

Request a Protocol

Ask a question

Favorite

We assume that there are N reads obtained from both chromosomes. For a haplotype with the length of l, an N × l fragment matrix R is constructed whose rows embed the reads and whose columns correspond to the heterozygous SNP sites [19, 20]. The SNP sites not covered by the reads are coded with zero. Then, bases of reads are converted to −1 (alternative allele) or 1 (reference allele), assuming bi-allelic SNPs.

As an example of an error-free case, consider the first exon of HLA-A, a gene on chromosome 6 -with NCBI reference sequence number NG_029217.2. Its first 40 bases are presented in Fig 1a. It contains five bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). The procedure of constructing the fragment matrix is depicted in Fig 1d. In this example, the exact haplotypes that should be reconstructed by the haplotype assembly algorithms are {CGTAG} and {TCGCA}.

This gene is located on chromosome 6 with NCBI reference sequence number NG_029217.2. It contains 5 bi-allelic SNP sites (refSNP): C/T (rs753601428), C/G (rs529070997), G/T (rs41560714), A/C (rs551138783) and A/G (rs778615037). a) An example of homologous chromosomes in which the SNP sites are indicated in bold, b) an example of aligned reads, c) the fragments after removing non-informative reads and non-SNP bases and d) the constructed fragment matrix.

The fragment matrix R can be modeled using a matrix completion approach [19, 20]. In the error-free case, R is a partially observed matrix modelled as

where M is the completed version of matrix R (see section B of S1 Appendix for more details). P_Ω is the observation operator defined as

in which Ω is the set of indices of known entries. In order to generalize the model to the more realistic case allowing erroneous entries, we use an additive measurement error model inspired by [11, 19, 20]:

To define the error matrix E, we should first clarify what we mean by an error. A substitution error is the conversion of a DNA base to one of the other three possible bases during the sequencing procedure. As mentioned earlier, during fragment matrix construction, only two bases (reference and alternative alleles) for each SNP site are permitted and other possible bases are ignored; as a result, a substitution to the ignored bases does not affect the entries of the fragment matrix. Accordingly, we introduce the term bi-allelic substitution, or simply bi-substitution to make it distinguishable from generally defined substitution. A bi-substitution error occurs when a reference allele is converted to the alternative allele or vice versa. Consequently, an error in the entries of P_Ω(M) is simplified as a change from −1 to 1 or vice versa. This can be formulated as an addition of 2 (or −2) to each erroneous entry of P_Ω(M) which is represented in error matrix E. We assumed that each non-zero entry of R is erroneous with a probability of p_e, the bi-substitution error probability, independent of the other entries. This value equals one third of the substitution error probability of the sequencing device p_s.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol