2.8. Algorithm for the Classification of Promoter Sequences from the A. thaliana Genome
This protocol is extracted from research article:
Multiple Alignment of Promoter Sequences from the Arabidopsis thaliana L. Genome
Genes (Basel), Jan 21, 2021; DOI: 10.3390/genes12020135

The MAHDS algorithm developed in this study was applied to align promoter sequences from the A. thaliana genome (downloaded from https://epd.epfl.ch//index.php [33]). Each promoter had length K (600 nt), which included the region from −499 to +100 bp relative to the first base of the start codon (position +1). There were 22,694 promoter sequences in the analyzed set denoted as PM (supplementary material 1). Since the algorithm shown in Figure 1 requires considerable resources to align all the promoter sequences, we created a sample containing 500 randomly chosen promoters, which were combined into one sequence S with L = 500 × 600 = 30,000 nt. Then, we constructed the MSA as described in Figure 1 and Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5, Section 2.6 and obtained mV(n1), two-dimensional alignment of sequences S1 and S, and PWM mW(600, 16).

However, the volume of the PM set was significantly larger than the 500 randomly selected promoters included in sequence S. Furthermore, promoter sequences from the PM set might not show statistically significant alignment with maxW(600, 16). Therefore, we aligned each promoter from the PM set with matrix maxW(600, 16) using Formula (5) and considering the promoter sequence as S with L = 600. As a result, F(L, L) for each promoter from the PM set was calculated and put into the Ves(i) vector (where i is the promoter number).

Then, the promoter sequences with statistically significant Ves(i) were selected from the PM set. To do this, we used PMR(i) sets obtained by random shuffling of the promoter sequence with number i; each PMR(i) set contained 103 random sequences of 600 bp. We aligned each sequence from PMR(i) relative to the maxW(600, 16) matrix, calculated F(L, L) denoted as Vesr(j) (j = 1, 2, …, 103), and then determined the mean Ves(j) and variance D(Vesr) and calculated Z for each Ves(i) using Formula (7). If Z > Z0, then the promoter was considered to have a statistically significant alignment with the maxW(600, 16) matrix. For Z0 = 5.0, the probability of random similarity between the promoter and maxW(600, 16) was about 10−6. All promoter sequences with Z > 5.0 were assigned to the same class characterized by the maxW(600, 16) matrix.

When we created the first class of the A.thaliana promoter sequences in this way, we removed all the sequences with Z > 5.0 from the PM set and created PM(1) set. The resulting set PM(1) was used to create further classes. The described procedure was repeated for the PM(1) set, from the creation of a new set of 500 randomly selected promoters. As a result, we created a second class of promoters and a PM(2) set. We repeated this procedure for the sets PM(i), i = 1,2, …. Each iteration created a new class and the corresponding maxW(600, 16) matrix. If on some iteration, the volume of the PM(i) set became less than 500 sequences, then we chose all the sequences for carrying out the multiple alignment. The multiple alignments generated for each class are shown in Supplementary material 1. The procedure was stopped at the iteration i = i0 when the size of classes with i > i0 was less than 100 sequences. We defined the size of classes equal to 100 based on the random sequence analysis. When we performed the procedure on randomly shuffled promoter sequences (total number is 22,694), the volume of the classes ranged from 6 to 27 sequences with an average value of 16 sequences. This means that with using the threshold we kept the type I error rate less than 16%.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.