A suffix tree based algorithm for the RSGM problem

Runmin Yang; Daming Zhu

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

A suffix tree based algorithm for the RSGM problem

RY Runmin Yang

DZ Daming Zhu

This method is extracted from research article: BMC Genomics, Sep 2018

A graph-based filtering method for top-down mass spectral identification

DOI: 10.1186/s12864-018-5026-x

Request a Protocol

Ask a question

Favorite

To solve the RSGM problem, we present a suffix tree based algorithm that extends the algorithm proposed by Deng et al. for the blocked pattern matching problem [21]. A suffix tree is used to represent the string S. We assume that each edge in the suffix tree is labeled with only one letter (integer residue mass) to simplify the algorithm description.

We first review the algorithm for the blocked pattern matching problem, which was proposed by Deng et al. [21]. A blocked pattern is represented as a spectrum graph in which all edges are in one path. Let G={V,E} be the graph representation of a blocked pattern P=p₁,p₂,…,p_m, where V={v₀,v₁…,v_m} and v₀,v₁,…,v_m is the only path from v₀ to v_m in G. Each prefix p₁,p₂,…,p_k of the blocked pattern P corresponds to a path v₀,v₁,…,v_k. A text string over Σ that matches the prefix p₁,p₂,…,p_k is called a prefix text string of v_k. A prefix text string is identifiable if it is a substring of S. For example, when P=114,255,87, both 114,156,99 and 114,99,156 are prefix text string of P. When S=114,156,99,87, the string 114,156,99 is an identifiable prefix text string of P, but 114,99,156 is not. If a prefix text string is not identifiable, then all text strings with the prefix are not identifiable, making it not necessary to search these text strings in S. Using identifiable prefix text strings significantly improves in the speed of searching a blocked pattern against S represented as a suffix tree.

Let B_i be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of v_i for 0≤i≤m, where m is the length of the blocked pattern. Specifically, B₀ contains only the root node of the suffix tree. The blocking pattern matching algorithm progressively finds the node sets B₁,…,B_m in the suffix tree. The node set B_m contains the solution to the blocked pattern matching problem.

Unlike the blocked pattern matching problem with only one blocked pattern, the spectrum graph G in the RSGM problem contains many paths from the start node to the end node, each path corresponding to a blocked pattern. The number of all paths in a spectrum graph may be an exponential function of the number of nodes. So it is an inefficient approach to directly extract all paths in a spectrum graph and search each path against a suffix tree separately.

Next, we describe our new algorithm for the RSGM problem, which extends the blocked pattern matching algorithm. Let V={v₀,v₁,…v_m} be the set of nodes in the increasing order of their corresponding masses, in which the start node s is v₀ and the end node t is v_m. A text string is a prefix text string of node v_i if it matches a blocked pattern corresponding to a path from v₀ to v_i. Let B_i be the set of nodes in the suffix tree corresponding to all identifiable prefix text strings of v_i, and C(e) be the set of all text strings whose masses match the labeled mass of e (Table 1). In practice, a precomputed table is used to quickly obtain C(e). Because the fragment masses in query spectra have measured errors, an error tolerance ε is allowed in generating the text strings in the table. Denote E_i⊆E the set of all edges whose tail nodes are v_i. For each edge e=(v_j,v_i) in E_i, we search from each suffix tree node in B_j to find all the paths corresponding to a text string in C(e) and add the end nodes of these paths to B_i (Algorithm 1).

An example set of text strings matched a mass 270.14

In proteoform identification, there are 20 common types of amino acids. The scaling factor 100 is used for the discretization of the residue masses of the 20 amino acids. The alphabet consists of 19 integers because leucine and isoleucine have the same discretized mass 11308. There are a total of 12 text strings whose descritized masses are 27014. For brevity, all masses in the text strings are represented by their corresponding amino acids. For example, the text string 11404, 15610 is represented by NR

After the last set B_m is found, the RSGM problem is solved by reporting all the text strings corresponding to a suffix tree node in B_m and their positions in S, which are stored in the suffix tree. Because the string S is represented by a suffix tree, the space complexity and time complexity of the algorithm are both O(n), where n is the length of the string S.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol