2.7. Comparison of Various MSA Methods

EK Eugene V. Korotkov
YS Yulia M. Suvorova
DK Dmitrii O. Kostenko
MK Maria A. Korotkova
request Request a Protocol
ask Ask a question
Favorite

The algorithm shown in Figure 1 can also be applied to determine the statistical significance of MSAs created by other algorithms. Let us denote the MSA as A, the length of each sequence in A as K, and the number of sequences as N. All sequences from A are linked to produce sequence S3 of length L = KN. Then, the PWM is calculated for A using Formula (2), transformed using Formulas (3) and (4), and applied to create the two-dimensional alignment for sequence S3 using Formulas (5) and (6) and to calculate F(L, L). The statistical significance of A is then computed according to Formula (7).

However, the columns that have a sum of elements < N/2 should be excluded from A to eliminate redundant deletions in the calculation of F(L, L), whereas those with the sum > N/2 cannot be excluded since it would lead to an excessive number of insertions. Consequently, the number of columns became K′K, resulting in a new alignment A′ (K′ is the length of each sequence in A′). To construct the PWM using A′, frequency matrix M(K′, 16) was first calculated using Formula (1) and then the PWM (designated as WA’) was calculated using Formula (2). Formulas (3) and (4) were applied to transform the resulting matrix and obtain matrix WTA, which was used to calculate F(L, L) (L = K′N) based on A’. For this, the sequence from A’ was merged with sequence S4 with all the spaces preserved. At the same time, sequence S5 containing column numbers {1, 2, …, K′} of the WTA matrix repeated N times was created. Then, we determined the sum of F1 = F1 + WT(s5(i),n), where n = s4(i − 1) + (s4(i) − 1) × 4 was calculated for all i from 2 to L = K’N, for which s4(i − 1) and s4(i) were not gaps, whereas for those i for which s4(i − 1) was a gap, the sum was calculated as F2 = F2 + E(s5(i),s4(i)). Matrix E was calculated from the WTA matrix using Formula (6). We also calculated F3 = −k1del, where k1 was the number of gaps in alignment A’, and del was the insertion/deletion penalty (Formula (5)), as well as F4 = −k2del, where k2 was the difference in the number of nucleotides between alignments A and A’. Finally, we calculated F(KN’, KN’) = F5 = F1 + F2F3F4.

Weight matrix WTA is the image of alignment A’, for which statistical significance can be estimated based on the effectiveness of the alignment between the WTA matrix and random sequences. If the alignment is random, then matrix WTA would be random too and F5 would be close to the value obtained for random sequences (Section 2.2).

Then, sequence S4 was randomly shuffled to create 200 sequences and matrix WTA was included in the Q set as described in Section 2.5. Each of the 200 sequences were treated as described in Section 2.2, Section 2.3, Section 2.4. As a result, 200 maxV(n1), each for a different random sequence, were obtained and used to calculate the mean maxV(n1) and variance D(maxV(n1)). Then, we calculated Z using Formula (7), where F5 was used rather than maxV(n1). The MSA constructed by different mathematical methods, including MAHDS, had the same algorithm for calculating Z, which allowed their comparison based on Z values (supplementary material 1).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A