MEME algorithm

Shaoliang Peng; Minxia Cheng; Kaiwen Huang; YingBo Cui; Zhiqiang Zhang; Runxin Guo; Xiaoyu Zhang; Shunyun Yang; Xiangke Liao; Yutong Lu; Quan Zou; Benyun Shi

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

MEME algorithm

SP Shaoliang Peng

MC Minxia Cheng

KH Kaiwen Huang

YC YingBo Cui

ZZ Zhiqiang Zhang

RG Runxin Guo

XZ Xiaoyu Zhang

SY Shunyun Yang

XL Xiangke Liao

YL Yutong Lu

QZ Quan Zou

BS Benyun Shi

This method is extracted from research article: BMC Bioinformatics, Aug 2018

Efficient computation of motif discovery on Intel Many Integrated Core (MIC) Architecture

DOI: 10.1186/s12859-018-2276-1

Request a Protocol

Ask a question

Favorite

The MEME algorithm [2] is a popular and well established motif discovery algorithm, which extends the expectation maximization (EM) [10] algorithm. Given a set of biopolymer sequences where little or nothing is known in advance about any motifs that may be present, MEME attempts to discover new motifs using a statistical motif model θ [11]. A motif model θ is a matrix of letter frequencies representing frequency estimates of letters occurring in different positions of a shared motif. Given a motif of width W from an alphabet Σ = {X₁,X₂, ...,X_N}which has N letters (eg. the alphabet of DNA is {A,T,C,G}), the size of the matrix θ is N×(W+ 1) and the matrix value θ_{i, j} (1≤i≤N and 0≤j≤W) is defined as follows:

With a set of input sequences, the EM algorithm is carried out from an initial model θ⁽⁰⁾ which represents a starting point. Then it runs until convergence in order to find the final motif model θ^(q) with maximal posterior probability. Besides, it can just run for a fixed number of iterations before convergence. MEME provides support for three different types of search modes: one occurrence per sequence (OOPS), zero or one occurrence per sequence (ZOOPS), and two component mixture (TCM) [2]. The type of model chosen by a user depends upon prior knowledge concerning the dataset. The OOPS model assumes that there is only one occurrence per sequence of the motif in the dataset, the ZOOPS model is a generalization of OOPS and assumes zero or one occurrence per sequence of the motif, and the TCM model assumes zero or more non-overlapping occurrences of the motif per sequence. Since the OOPS and ZOOPS models are sufficient for most motif finding applications, this paper concentrates on the support for the OOPS and the ZOOPS search models.

During each motif search, MEME does a multi-start search of a given motif width W and the search consists of two stages: starting point searching and EM [11].

In the starting point searching stage, MEME iterates over all possible initial models and chooses the initial models with the highest statistical significance. More specifically, MEME converts each W length substring occurring in a sequence dataset into a motif model and calculates the weighted log likelihood ratio on different numbers of predicted sites. The potential motif models with the highest weighted log likelihood ratio are selected as starting points for the successive EM stages. Also in this stage, a P-value is calculated, which represents the probability of a random string, generated from the background letter frequencies.

In the EM stage, An EM algorithm is performed for a fixed number of iterations or until convergence from each of the highest-scoring initial models and then the best motif model is chosen.

During the starting point searching, Given the input dataset S = {S₁, S₂, ..., S_n} of n biological sequences from Σ, and motif width W, The following notations are used for the convenience of discussion: L_i denotes the length of sequence S_i, ${\bar{s}}_{i}$ denotes the reverse complement of S_i, S_{i, j} denotes the substring starting at position j in sequence S_i, S_i(j) denotes the jth letter in S_i, for 1≤i≤n and 0≤j≤L_i − W. The starting point searching process primarily consists of four steps for the OOPS and ZOOPS models:

• Calculate the probability score P(S_{i, j}, S_{k, l}) from the forward strand (or P(S_{i, j}, ${\bar{s}}_{k, l}$ ) from the reverse complement), which represents the probability that a site starts at position l in S_k when a site starts at position j in S_i. The time complexity is O(l_i·l_k) for each sequence pair S_i and S_k.

• select the highest-scoring substring S_{k, maxk} (as well as its strand orientation) for each sequence S_k. The time complexity is O(l_k) for each sequence S_k.

• Sort the nsites0 highest-scoring substrings {S_{k, maxk}} in decreasing order of scores and determine the potential starting points. The time complexity is O(nlogn) for OOPS and O(n²W) for ZOOPS.

• Update the hash map and starting point heap.

The probability score P(S_{i, j}, S_{k, l}) is computed as:

where mat denotes the letter frequency matrix of size |Σ| × |Σ|. To reduce computation redundancy, Eq. (2) can be further simplified to Eq. (3), where the computation of the probability scores {P(S_{i, j}, S_{k, l})} in the jth iteration depends on the resulting scores {P(S_{i, j − 1}, S_{k, l − 1})} in the (j-1)th iteration. However, {P(S_{i, j}, S_{k, 0})} needs to be computed individually using Eq. (2).

The detailed algorithm for the starting point searching is illustrated in Algorithm 1. And Fig. 1 shows the process of starting point searching algorithm for the OOPS and ZOOPS models. MEME Suite [12] is an open source implementation of MEME algorithm. The method presented in this paper is based on MEME Suite (version 4.11.2).

Process of starting point searching algorithm

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol