2.2.1. Natural Vector Method (NV)
This protocol is extracted from research article:
iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins
Comput Math Methods Med, Jan 7, 2021; DOI: 10.1155/2021/6664362

The natural vector method (NV) was designed by Deng et al. [20] for performing evolutionary and phylogenetic analysis of biological sequence groups. Based on the natural vector method, each protein sequence can be mapped into a 60-dimensional numeric vector which contains the occurrence frequencies, the average positions, and the central moments of the twenty amino acids. This method is alignment free and needs no parameters. Thus, it has been proven to be a powerful tool for virus classification, phylogeny, and protein prediction [2123]. Its details will be described as follows.

First, suppose that each BLP (or non-BLP) sequence sample P with length L can be formulated by

that is, for the set of 20 amino acids, Si ∈ {A, C, D, ⋯, W, Y}, i = 1, 2, 3 ⋯ L. And for each of the 20 amino acids k, we may define

where  wk(Si) = 1, if Si = k. Otherwise, wk(Si) = 0.

Second, the number of amino acid k in the protein sequence P, defined as nk, can be calculated as follows:

Next, let S|k||i| be the distance from the first amino acid (regarded as origin) to the i-th amino acid k in the protein sequence, Tk be the total distance of each set of the 20 amino acids, and μk  be the mean position of the amino acid k. Therefore, they can be calculated as follows:

Let us take the amino acid sequence MCRAACGECFR as an example. For amino acid A, nA = 2, the total distance of A is TA = 3 + 4 = 7 since the distances from the first residue to the two As are 3 and 4, respectively. Then, μA = TA/nA = 7/2. Similarly, TC = 1 + 5 + 8 = 14 with nC = 3 and μC = TC/nC = 14/3. The arithmetic mean value of total distance for other kinds of amino acids can be obtained in the same way.

Protein sequences with the different distribution of each amino acid might be different even if they have the same amino acid content and distance measurement. Therefore, the information about distribution has also been included in the natural vector. And then, the second-order normalized central moments D2k can be defined as follows:

The second normalized central moment is the variance of the distance distribution for each amino acid.

For the sufficiency annotation of protein sequences, the three groups of parameters, the number of each amino acid, the mean value of total distance of each amino acid, and the information of distance distribution, were concatenated to obtain the final natural vector. As a result, the 60-dimensional natural vector of a protein sequence P is obtained and defined as

where the symbol "T" is the transpose operator.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.