2.8. Algorithm for the Classification of Promoter Sequences from the A. thaliana Genome

This protocol is extracted from research article:

Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome

**
Genes (Basel)**,
Jan 21, 2021;
DOI:
10.3390/genes12020135

Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome

Procedure

The MAHDS algorithm developed in this study was applied to align promoter sequences from the *A. thaliana* genome (downloaded from https://epd.epfl.ch//index.php [33]). Each promoter had length *K* (600 nt), which included the region from −499 to +100 bp relative to the first base of the start codon (position +1). There were 22,694 promoter sequences in the analyzed set denoted as *PM* (supplementary material 1). Since the algorithm shown in Figure 1 requires considerable resources to align all the promoter sequences, we created a sample containing 500 randomly chosen promoters, which were combined into one sequence *S* with *L* = 500 × 600 = 30,000 nt. Then, we constructed the MSA as described in Figure 1 and Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6 and obtained *mV*(*n*_{1}), two-dimensional alignment of sequences *S*_{1} and *S*, and PWM *mW*(600, 16).

However, the volume of the *PM* set was significantly larger than the 500 randomly selected promoters included in sequence *S*. Furthermore, promoter sequences from the *PM* set might not show statistically significant alignment with *maxW*(600, 16). Therefore, we aligned each promoter from the *PM* set with matrix *maxW*(600, 16) using Formula (5) and considering the promoter sequence as *S* with *L* = 600. As a result, *F*(*L*, *L*) for each promoter from the *PM* set was calculated and put into the *Ves*(*i*) vector (where *i* is the promoter number).

Then, the promoter sequences with statistically significant *Ves*(*i)* were selected from the *PM* set. To do this, we used *PMR*(*i*) sets obtained by random shuffling of the promoter sequence with number *i*; each *PMR*(*i*) set contained 10^{3} random sequences of 600 bp. We aligned each sequence from *PMR*(*i*) relative to the *maxW*(600, 16) matrix, calculated *F*(*L*, *L*) denoted as *Vesr*(*j*) (*j* = 1, 2, …, 10^{3}), and then determined the mean *Ves*(*j)* and variance *D*(*Vesr*) and calculated *Z* for each *Ves*(*i*) using Formula (7). If *Z* > *Z*_{0}, then the promoter was considered to have a statistically significant alignment with the *maxW*(600, 16) matrix. For *Z*_{0} = 5.0, the probability of random similarity between the promoter and *maxW*(600, 16) was about 10^{−6}. All promoter sequences with Z > 5.0 were assigned to the same class characterized by the *maxW*(600, 16) matrix.

When we created the first class of the *A.thaliana* promoter sequences in this way, we removed all the sequences with Z > 5.0 from the *PM* set and created *PM*(1) set. The resulting set *PM*(1) was used to create further classes. The described procedure was repeated for the *PM*(1) set, from the creation of a new set of 500 randomly selected promoters. As a result, we created a second class of promoters and a *PM*(2) set. We repeated this procedure for the sets *PM*(*i*), *i* = 1,2, …. Each iteration created a new class and the corresponding *maxW*(600, 16) matrix. If on some iteration, the volume of the PM(*i*) set became less than 500 sequences, then we chose all the sequences for carrying out the multiple alignment. The multiple alignments generated for each class are shown in Supplementary material 1. The procedure was stopped at the iteration *i* = *i*_{0} when the size of classes with *i* > *i*_{0} was less than 100 sequences. We defined the size of classes equal to 100 based on the random sequence analysis. When we performed the procedure on randomly shuffled promoter sequences (total number is 22,694), the volume of the classes ranged from 6 to 27 sequences with an average value of 16 sequences. This means that with using the threshold we kept the type I error rate less than 16%.

Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.