FSOT

ZY Zeyu Yin
YC Yu Chen
YH Yajie Hao
SP Sanjeevi Pandiyan
JS Jinsong Shao
LW Li Wang
ask Ask a question
Favorite

FSOT (Fragment Slicing based on Optimal Transport) is a fragments slicing method for proteins and compounds based on optimal transport algorithm. Similar to BPE merging rules, the original compound and protein sequences are split into separate markers. For simplicity, BPE generates the fragment candidate lists, using ZINC Clean Lead datasets for compound token generation and PDBbind datasets for protein token generation. All tokens are initialized with probabilities for the optimal transport algorithm. At each time step, the vocabulary with maximum entropy is obtained based on the transport matrix. Due to relaxed restrictions, cases of illegal marginal utility occur, so tokens with frequencies below 0.001 are removed.

Formally, Marginal Utility of Vocabularization (MUV) represents the negative derivation of entropy to size. For simplification, we leverage a smaller vocabulary to estimate MUV in implementation. Specially, MUV is calculated as

where v(k), v(k+m) are two vocabularies with k and k + m tokens, respectively. Hv represents the corpus entropy with the vocabulary v, which is defined by the sum of token entropy. To avoid the effects of token length, here we normalize entropy with the average length of tokens and the final entropy is defined as:

where P(i) is the relative frequency of token i from the training corpus and lv is the average length of tokens in vocabulary v

S = {i, 2·i, …, (t−1) · i, … } where each timestep t represents a set of vocabularies with the number up to S[t]. For any vocabulary, its MUV score can be calculated based on a vocabulary from its previous timestep. With sequence S, the target to find the optimal vocabulary v(t) with the highest MUV can be formulated as:

where VS[t1] and VS[t] are two sets containing all vocabularies with upper bound of size S[t1] and S[t]. Due to exponential search space, we propose to optimize its lower bound:

where i means the size difference between t1 vocabulary and t vocabulary. MUV requires the size difference as a denominator. Based on this equation, the whole solution is split into two steps: 1) searching for the optimal vocabulary with the highest entropy at each time step t; 2) enumerating all timesteps and outputting the vocabulary corresponding to the time step satisfying Equation 4.

The first step of our approach is to search for the vocabulary with the highest entropy from VS[t]. Formally, the goal is to find a vocabulary v(t) such that entropy is maximized,

Given a set of vocabularies VS[t], we want to find the vocabulary with the highest entropy. Consequently, the objective function in Equation 6 becomes:

After the vocabulary is generated, VOLT uses the greedy strategy to encode the text like a BPE. To encode a compound, it first breaks the compound down into character-level markers. If the combined token is in the vocabulary, the two consecutively tokens are merged into one token. This process runs until there are no more tokens to merge. The out-of-vocabulary tags are split into smaller tags.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A