After the alignment of C S B Vs, identity vectors (I V) are created for every C S B V. All I Vs have the same length and they represent the percentage of identity that a certain region of length W has in the alignment. We take a window of length W to calculate that percentage of identity.
First we create a binary vector (V B) which represents matches in the alignment. V B has the length of the alignment. Since V B takes into account gaps, its length can be different from one C S B V to another. By using a window of length W, we can compute the percentage of identity at any point in V B. As long as we are going to compare I V from different C S B Vs, identity values from those points in the alignment that represent a gap in sequence X are not stored. This way, all identity vectors from different C S B Vs will have the same length, R O I length.
Low values in parameter W produce a noisy identity vector corresponding with high frequency changes of identity. On the contrary, high values in parameter W smooth the noise and produce a low frequency signal. The selection of a proper W value is not possible as it might change depending on the C S B V involved. We could also be interested on changes that happen at different frequencies. Therefore, instead of choosing a fixed W value, which would mean changes at only one frequency, we build a vector containing all frequencies as follows:
where A i is the weight of the identity vector at a certain frequency
And the Identity vector at a certain frequency is calculated as follows:
In this model, N defines the maximum window to compute the percentage of identity and also defines the start and end positions where the values of the vector can be used. From 0 to 2N+1 and from 2N+1−R O I length to R O I length the I V is uncompleted. Therefore, N cannot be as long as we want. It should be at least lesser than OFFSET. In practice we have observed that a value of 50 is enough to get good results.
Finally, since identity vectors are going to be compared, they must to be normalized.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.