A suffix tree based algorithm for the RSGM problem

RY Runmin Yang
DZ Daming Zhu
request Request a Protocol
ask Ask a question
Favorite

To solve the RSGM problem, we present a suffix tree based algorithm that extends the algorithm proposed by Deng et al. for the blocked pattern matching problem [21]. A suffix tree is used to represent the string S. We assume that each edge in the suffix tree is labeled with only one letter (integer residue mass) to simplify the algorithm description.

We first review the algorithm for the blocked pattern matching problem, which was proposed by Deng et al. [21]. A blocked pattern is represented as a spectrum graph in which all edges are in one path. Let G={V,E} be the graph representation of a blocked pattern P=p1,p2,…,pm, where V={v0,v1…,vm} and v0,v1,…,vm is the only path from v0 to vm in G. Each prefix p1,p2,…,pk of the blocked pattern P corresponds to a path v0,v1,…,vk. A text string over Σ that matches the prefix p1,p2,…,pk is called a prefix text string of vk. A prefix text string is identifiable if it is a substring of S. For example, when P=114,255,87, both 114,156,99 and 114,99,156 are prefix text string of P. When S=114,156,99,87, the string 114,156,99 is an identifiable prefix text string of P, but 114,99,156 is not. If a prefix text string is not identifiable, then all text strings with the prefix are not identifiable, making it not necessary to search these text strings in S. Using identifiable prefix text strings significantly improves in the speed of searching a blocked pattern against S represented as a suffix tree.

Let Bi be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of vi for 0≤im, where m is the length of the blocked pattern. Specifically, B0 contains only the root node of the suffix tree. The blocking pattern matching algorithm progressively finds the node sets B1,…,Bm in the suffix tree. The node set Bm contains the solution to the blocked pattern matching problem.

Unlike the blocked pattern matching problem with only one blocked pattern, the spectrum graph G in the RSGM problem contains many paths from the start node to the end node, each path corresponding to a blocked pattern. The number of all paths in a spectrum graph may be an exponential function of the number of nodes. So it is an inefficient approach to directly extract all paths in a spectrum graph and search each path against a suffix tree separately.

Next, we describe our new algorithm for the RSGM problem, which extends the blocked pattern matching algorithm. Let V={v0,v1,…vm} be the set of nodes in the increasing order of their corresponding masses, in which the start node s is v0 and the end node t is vm. A text string is a prefix text string of node vi if it matches a blocked pattern corresponding to a path from v0 to vi. Let Bi be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of vi, and C(e) be the set of all text strings whose masses match the labeled mass of e (Table 1). In practice, a precomputed table is used to quickly obtain C(e). Because the fragment masses in query spectra have measured errors, an error tolerance ε is allowed in generating the text strings in the table. Denote EiE the set of all edges whose tail nodes are vi. For each edge e=(vj,vi) in Ei, we search from each suffix tree node in Bj to find all the paths corresponding to a text string in C(e) and add the end nodes of these paths to Bi (Algorithm 1).

An example set of text strings matched a mass 270.14

In proteoform identification, there are 20 common types of amino acids. The scaling factor 100 is used for the discretization of the residue masses of the 20 amino acids. The alphabet consists of 19 integers because leucine and isoleucine have the same discretized mass 11308. There are a total of 12 text strings whose descritized masses are 27014. For brevity, all masses in the text strings are represented by their corresponding amino acids. For example, the text string 11404, 15610 is represented by NR

After the last set Bm is found, the RSGM problem is solved by reporting all the text strings corresponding to a suffix tree node in Bm and their positions in S, which are stored in the suffix tree. Because the string S is represented by a suffix tree, the space complexity and time complexity of the algorithm are both O(n), where n is the length of the string S.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A