The ARSI score was determined based to the following scheme: A given gene, transcript, or genetic region (UTR, intron, CDS, etc.) P, can be described as a sequence of nucleotides S; thus, the measure is based on the tendency of substrings in S to appear in other genetic elements, i.e. in a reference set G. Hence, computing the ARSI (G,S) score of a specified sequence (S) given a reference set of genomic elements (G) is done in two steps (see Fig. 1b): 1) For each position i in the sequence S find the longest substring S ij that starts in that position and appears in at least one of the sequences of the reference set G. 2) Let |S| denote the length of a sequence S; the ARSI of S is the mean length of all the substrings S ij, i.e. ARSI = ∑|s ij|/|S|.
Please note that the ARSI measure is based on a reference genome of a given organism, and therefore is not expected to be affected by various sequencing errors/biases that appear in Next Generation Sequencing (NGS) experiments. Specifically, in this study the error rate is very low for the analyzed organisms (less than 1 to 1000). As these errors distribute relatively uniformly, their effect the ARSI score is negligible: for example in E. coli the Spearman correlation between the ARSI scores and the one obtains for a simulation with uniform error rate of 1:1000 is higher than 0.99 (p < 5 · 10−323) for all 100 such randomization that were performed.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.