2.2.1. Natural Vector Method (NV)

DZ Dan Zhang HC Hua-Dong Chen HZ Hasan Zulfiqar SY Shi-Shi Yuan QH Qin-Lai Huang ZZ Zhao-Yue Zhang KD Ke-Jun Deng

This protocol is extracted from research article:

iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins

**
Comput Math Methods Med**,
Jan 7, 2021;
DOI:
10.1155/2021/6664362

iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins

DOI:
10.1155/2021/6664362

Procedure

The natural vector method (NV) was designed by Deng et al. [20] for performing evolutionary and phylogenetic analysis of biological sequence groups. Based on the natural vector method, each protein sequence can be mapped into a 60-dimensional numeric vector which contains the occurrence frequencies, the average positions, and the central moments of the twenty amino acids. This method is alignment free and needs no parameters. Thus, it has been proven to be a powerful tool for virus classification, phylogeny, and protein prediction [21–23]. Its details will be described as follows.

First, suppose that each BLP (or non-BLP) sequence sample *P* with length *L* can be formulated by

that is, for the set of 20 amino acids, *S*_{i} ∈ {*A*, *C*, *D*, ⋯, *W*, *Y*}, *i* = 1, 2, 3 ⋯ *L*. And for each of the 20 amino acids *k*, we may define

where *w*_{k}(*S*_{i}) = 1, if *S*_{i} = *k*. Otherwise, *w*_{k}(*S*_{i}) = 0.

Second, the number of amino acid *k* in the protein sequence *P*, defined as *n*_{k}, can be calculated as follows:

Next, let *S*_{|k||i|} be the distance from the first amino acid (regarded as origin) to the *i*-th amino acid *k* in the protein sequence, *T*_{k} be the total distance of each set of the 20 amino acids, and *μ*_{k} be the mean position of the amino acid *k*. Therefore, they can be calculated as follows:

Let us take the amino acid sequence MCRAACGECFR as an example. For amino acid *A*, *n*_{A} = 2, the total distance of *A* is *T*_{A} = 3 + 4 = 7 since the distances from the first residue to the two *A*s are 3 and 4, respectively. Then, *μ*_{A} = *T*_{A}/*n*_{A} = 7/2. Similarly, *T*_{C} = 1 + 5 + 8 = 14 with *n*_{C} = 3 and *μ*_{C} = *T*_{C}/*n*_{C} = 14/3. The arithmetic mean value of total distance for other kinds of amino acids can be obtained in the same way.

Protein sequences with the different distribution of each amino acid might be different even if they have the same amino acid content and distance measurement. Therefore, the information about distribution has also been included in the natural vector. And then, the second-order normalized central moments *D*_{2}^{k} can be defined as follows:

The second normalized central moment is the variance of the distance distribution for each amino acid.

For the sufficiency annotation of protein sequences, the three groups of parameters, the number of each amino acid, the mean value of total distance of each amino acid, and the information of distance distribution, were concatenated to obtain the final natural vector. As a result, the 60-dimensional natural vector of a protein sequence *P* is obtained and defined as

where the symbol ^{"}*T*^{"} is the transpose operator.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A

Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.