MEME algorithm

SP Shaoliang Peng
MC Minxia Cheng
KH Kaiwen Huang
YC YingBo Cui
ZZ Zhiqiang Zhang
RG Runxin Guo
XZ Xiaoyu Zhang
SY Shunyun Yang
XL Xiangke Liao
YL Yutong Lu
QZ Quan Zou
BS Benyun Shi
request Request a Protocol
ask Ask a question
Favorite

The MEME algorithm [2] is a popular and well established motif discovery algorithm, which extends the expectation maximization (EM) [10] algorithm. Given a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present, MEME attempts to discover new motifs using a statistical motif model θ [11]. A motif model θ is a matrix of letter frequencies representing frequency estimates of letters occurring in different positions of a shared motif. Given a motif of width W from an alphabet Σ = {X1,X2, ...,XN}which has N letters (eg. the alphabet of DNA is {A,T,C,G}), the size of the matrix θ is N×(W+ 1) and the matrix value θi, j (1iN and 0jW) is defined as follows:

With a set of input sequences, the EM algorithm is carried out from an initial model θ(0) which represents a starting point. Then it runs until convergence in order to find the final motif model θ(q) with maximal posterior probability. Besides, it can just run for a fixed number of iterations before convergence. MEME provides support for three different types of search modes: one occurrence per sequence (OOPS), zero or one occurrence per sequence (ZOOPS), and two component mixture (TCM) [2]. The type of model chosen by a user depends upon prior knowledge concerning the dataset. The OOPS model assumes that there is only one occurrence per sequence of the motif in the dataset, the ZOOPS model is a generalization of OOPS and assumes zero or one occurrence per sequence of the motif, and the TCM model assumes zero or more non-overlapping occurrences of the motif per sequence. Since the OOPS and ZOOPS models are sufficient for most motif finding applications, this paper concentrates on the support for the OOPS and the ZOOPS search models.

During each motif search, MEME does a multi-start search of a given motif width W and the search consists of two stages: starting point searching and EM [11].

In the starting point searching stage, MEME iterates over all possible initial models and chooses the initial models with the highest statistical significance. More specifically, MEME converts each W length substring occurring in a sequence dataset into a motif model and calculates the weighted log likelihood ratio on different numbers of predicted sites. The potential motif models with the highest weighted log likelihood ratio are selected as starting points for the successive EM stages. Also in this stage, a P-value is calculated, which represents the probability of a random string, generated from the background letter frequencies.

In the EM stage, An EM algorithm is performed for a fixed number of iterations or until convergence from each of the highest-scoring initial models and then the best motif model is chosen.

During the starting point searching, Given the input dataset S = {S1, S2, ..., Sn} of n biological sequences from Σ, and motif width W, The following notations are used for the convenience of discussion: Li denotes the length of sequence Si, s¯i denotes the reverse complement of Si, Si, j denotes the substring starting at position j in sequence Si, Si(j) denotes the jth letter in Si, for 1in and 0jLi − W. The starting point searching process primarily consists of four steps for the OOPS and ZOOPS models:

• Calculate the probability score P(Si, j, Sk, l) from the forward strand (or P(Si, j, s¯k,l) from the reverse complement), which represents the probability that a site starts at position l in Sk when a site starts at position j in Si. The time complexity is O(li·lk) for each sequence pair Si and Sk.

• select the highest-scoring substring Sk, maxk (as well as its strand orientation) for each sequence Sk. The time complexity is O(lk) for each sequence Sk.

• Sort the nsites0 highest-scoring substrings {Sk, maxk} in decreasing order of scores and determine the potential starting points. The time complexity is O(nlogn) for OOPS and O(n2W) for ZOOPS.

• Update the hash map and starting point heap.

The probability score P(Si, j, Sk, l) is computed as:

where mat denotes the letter frequency matrix of size |Σ| × |Σ|. To reduce computation redundancy, Eq. (2) can be further simplified to Eq. (3), where the computation of the probability scores {P(Si, j, Sk, l)} in the jth iteration depends on the resulting scores {P(Si, j − 1, Sk, l − 1)} in the (j-1)th iteration. However, {P(Si, j, Sk, 0)} needs to be computed individually using Eq. (2).

The detailed algorithm for the starting point searching is illustrated in Algorithm 1. And Fig. 1 shows the process of starting point searching algorithm for the OOPS and ZOOPS models. MEME Suite [12] is an open source implementation of MEME algorithm. The method presented in this paper is based on MEME Suite (version 4.11.2).

Process of starting point searching algorithm

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A