Sequence features are obtained by the process of feature extraction, which refers to extracting numeric information from protein sequences. The features are the values that can be used to learn the underlying model. Feature extraction is often the most critical step in determining whether the method will ultimately be successful. The features from windows of protein sequences were extracted using different amino acid descriptors. Some of the chosen descriptors were proposed by previous studies for phosphorylation site prediction, as it has been found that they contribute with varying degrees of information about the phosphosite. The descriptors implemented in this study are summarized as follows.
Shannon Entropy (H) is known in information theory as a measure of randomness and diversity of a set of objects distributed into a space. It was defined by Shannon as a unique function that represents the average amount of information for a set of objects according to their probabilities [34]. It has been widely used in bioinformatics to score residue conservation [35]. However, in this study, instead of using position-specific entropy, which is calculated with position-specific scoring matrix (PSSM) [36], we used window-wise entropy that is calculated with probabilities of the individual amino acids in the window to generate one numeric feature. It can be calculated as
where p i is the probability of an amino acid i = (A, C, E, D, G, F, I, H, K, M, L, N, Q, P, S, R, T, W, V, Y) in the sequence and it is computed as the total number of amino acids i divided by the length of the window assuming that the probability of any amino acid that does not exist in the window is zero. Entropy ranges between zero, where only one type of residue in the entire sequence is found, and 3.17, where all types of amino acids have equal occurrence in the window.
The window-wise relative entropy (RE) of two distributions p i and p 0, also known as Kullback-Leibler distance, is calculated as
where p 0 = 1/9 is the uniform distribution of the amino acid occurrence.
RE is always nonnegative and becomes zero if and only if p i = p 0. As entropy, the RE is represented by one feature for each window. We again assumed that the probability of any amino acid that does not exist in the window is zero. The RE was used in previous studies to identify the conserved position [37, 38].
Information gain (IG) can be computed by subtracting RE from entropy. It can measure the transformation of information from the background or random state to the state influenced by the class whether the sequence is positive or negative. IG is given by
The amino acids of a protein sequence can be either buried or exposed based on their position in the 3-dimensional structure of the protein. Usually, the buried residues do not undergo posttranslational modification because they are not expected to interact with the modifying enzymes. Therefore, phosphorylation sites in the protein are expected to be exposed amino acids. Rvp-net [39], software for prediction of ASA, was used to extract ASA features from the benchmark protein sequences. ASA features were predicted before dividing the sequences into windows.
Overlapping properties (OP) capture the common physicochemical properties shared by the amino acids in the protein sequence [22, 40]. The amino acids were classified based on ten physicochemical properties: polar (NQSDECTKRHYW), positive (KHR), negative (DE), charged (KHRDE), hydrophobic (AGCTIVLKHFWYM), aliphatic (IVL), aromatic (FYWH), small (PNDTCAGSV), tiny (ASGC), and proline (P). An amino acid may fall into more than one group (i.e., be overlapping). Each amino acid was encoded with 10-bit, where each bit in the 10-bit code represents a group, respectively. The position of the bit is set to 1 if the amino acid belongs to the corresponding group and 0 if it does not. For example, histidine (H) is encoded with 1101101000, which indicates that it belongs to polar, positive, charged, hydrophobic, and aromatic groups. The number of features extracted with this method is n × 10 where n is the window size [40]. For the sequence window of size 9, the number of features is 90.
The average cumulative hydrophobicity (ACH) has been used in previous studies as a protein descriptor to predict phosphorylation sites [22, 41]. ACH quantifies the tendency of the amino acids that surround the phosphorylation sites to interact with solvents. The Eisenberg hydrophobicity scales [42] have been used where
The number of ACH features depends on the size of the window. For a window of size 9 the ACH is computed by averaging the cumulative hydrophobicity indices of the amino acids around the putative phosphorylation site for the subwindows of the sizes 3, 5, 7, and 9, respectively, where S/T/Y is always in the center of the window. For example, to calculate ACH for the sequence KAGVSPHED, we need first to create the subwindows AGVSPHE, GVSPH, and VSP. Then we can calculate the feature of each window as
where n is the subwindow size and P i is hydrophobicity index for the amino acid in the position i in the window. For this example the number of features is four.
Sequence features (SF) [22] are another form of amino acid composition and they have been used recently with other feature types to predict phosphorylation sites. SF features are extracted by encoding each amino acid with a unique 20-bit of one position as 1 and other positions as zeros (e.g., 00100000000000000000). The number of the SF features depends on the window size. For instance, for a sequence with a window size of 9, the number of features will be 9 × 20 = 180.
To extract the composition, transition, and distribution (CTD) features [43, 44], first the 20 amino acids are categorized into 3 groups based on one out of seven physicochemical properties each time. The seven amino acid properties are hydrophobicity; normalized Van der Waals volume; polarity; polarizibility; charge; secondary structures; and solvent accessibility [44]. Based on each property, the amino acids are encoded as 1, 2, or 3. For example, the sequence MVKELRTA is encoded as 33113122 based on hydrophobicity.
Composition is defined as the global percent for each encoded group in a sequence based on the property p, where p is any of the seven properties. There are 21 composition features (3 features for each one of the seven physicochemical properties). The composition is calculated as
where n r is the number of group codes r in the window and n is the number of amino acids in the window.
Transition is defined as the percent frequency with which a code (r) is followed by another code (s). Since there are three possible codes, the possible transitions are (1, 2), (1, 3), and (2, 3). The number of features is 21 (3 for each one of the seven physicochemical properties). The transition can be given as follows:
where N is the length of the window.
Distribution is defined as the distribution of each encoded group (1, 2, and 3) in the sequence for the first, 25%, 50%, 75%, and 100% distributions of a particular property. The number of feature elements for the distribution is 105 (15 for each one of the seven physicochemical properties). The residue position is calculated by
where D is 25%, 50%, 75%, or 100%. The distribution is then calculated by dividing R by the length of the sequence and multiplying by 100.
Sequence order coupling features are calculated using Schneider-Wrede chemical distance matrix [45]. For a protein window of N amino acids, the sequence order effect [46, 47] can be approximately computed as
where τ k is the kth rank of the sequence order coupling number (SOCN), m is maximum lag, and d i,i+k is the chemical distance between the residue in position i and position i + k. SOCN has 60 feature elements.
The first 20 features of QSO [46, 47] are the frequencies of amino acids in the window and calculated by
where i = 1,2,…, 20, f i is the normalized frequency of the amino acid i, and w is a weighting factor (w = 0.1).
The features from 21 and upward reflect the sequence order using four physicochemical properties; hydrophobicity, hydrophilicity, polarity, and side-chain volume and the Schneider-Wrede chemical distance matrix [48]. These parameters are calculated by
where k = 21,22,…, 30, w is the weight = 0.1, and τ k is the kth rank of the sequence order coupling as shown above. QSO has 100 feature elements. After extracting the features, the feature vector for each window can be represented as
where the subscript numbers are the position indices of the feature (f) of the corresponding descriptor. The total number of features, based on 9-amino-acid window size, is 593.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.