Calculating identity vectors

JA Arjona-Medina; O Trelles

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Calculating identity vectors

JA JA Arjona-Medina

OT O Trelles

This method is extracted from research article: BMC Genomics, Oct 2016

Refining borders of genome-rearrangements including repetitions

DOI: 10.1186/s12864-016-3069-4

Request a Protocol

Ask a question

Favorite

After the alignment of C S B _Vs, identity vectors (I _V) are created for every C S B _V. All I _Vs have the same length and they represent the percentage of identity that a certain region of length W has in the alignment. We take a window of length W to calculate that percentage of identity.

First we create a binary vector (V _B) which represents matches in the alignment. V _B has the length of the alignment. Since V _B takes into account gaps, its length can be different from one C S B _V to another. By using a window of length W, we can compute the percentage of identity at any point in V _B. As long as we are going to compare I _V from different C S B _Vs, identity values from those points in the alignment that represent a gap in sequence X are not stored. This way, all identity vectors from different C S B _Vs will have the same length, R O I _length.

Low values in parameter W produce a noisy identity vector corresponding with high frequency changes of identity. On the contrary, high values in parameter W smooth the noise and produce a low frequency signal. The selection of a proper W value is not possible as it might change depending on the C S B _V involved. We could also be interested on changes that happen at different frequencies. Therefore, instead of choosing a fixed W value, which would mean changes at only one frequency, we build a vector containing all frequencies as follows:

where A _i is the weight of the identity vector at a certain frequency

And the Identity vector at a certain frequency is calculated as follows:

In this model, N defines the maximum window to compute the percentage of identity and also defines the start and end positions where the values of the vector can be used. From 0 to 2N+1 and from 2N+1−R O I _length to R O I _length the I _V is uncompleted. Therefore, N cannot be as long as we want. It should be at least lesser than OFFSET. In practice we have observed that a value of 50 is enough to get good results.

Finally, since identity vectors are going to be compared, they must to be normalized.

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol