Determination of penalty scores for evaluating individual barcodes

IY In Seok Yang
SB Sang Won Bae
BP BeumJin Park
SK Sangwoo Kim
request Request a Protocol
ask Ask a question
Favorite

To apply the barcode factors to the barcode set optimization, we used a penalty-based selection strategy. To determine the most appropriate penalty scores for GCC, HP, and SR (PGCC, PHP, and PSR, respectively), we obtained appropriate equations using the curve fitting method on the website MyCurveFit [31] based on distributions observed in random barcodes with l = 12. To obtain gcc distribution, we calculated mean barcode counts at each gcc value. A Gaussian bell curve with the formula f(x)=a×e(xb)2/2c2 was used to fit its distribution, where x indicates gcc. We set f(gcc) to be maximum if gcc was in the range from 40% to 60% and minimum at the points gcc = 0% or 100%. Then, we determined PGCC values with a maximum value of 106 at the points gcc = 0% or 100% and a minimum value of 0 in the range of gcc = 40% to 60% by inverting the distribution of f(gcc) values. For distributions of hp and sr, we adopted median barcode counts at each data point of hp and sr, respectively. When hp and sr were examined in a barcode sequence, we only considered them greater than or equal to 2, because there are many cases with hp = 1 and sr = 1 in a barcode. An exponential curve obtained with the formula f(x) = a + b × ecx was applied to their random distributions, where x represents hp or sr. In the equations, constant terms (a) were removed to obtain f(hp) or f(sr) values greater than or equal to 0. We determined PHP and PSR scores by subtracting f(hp) and f(sr) values from 106, respectively, with a modification to transform penalty scores into positive integers.

To generate penalty scores for HD (PHD), we implemented the concept of accumulation of mutations (herein, substitutions of a base at a nucleotide position) to represent distances or differences between barcode sequences. In a given barcode set with l = n, there exist barcode pairs with a relationship of hd values ranging from 1 to n. Since barcode sequences are sequenced with sample DNA or RNA fragments, unexpected mutations can be observed in the sequences. In result, barcodes with hd = 1, 2, or 3 are more sensitive to misclustering than others, sometimes leading to inaccurate interpretation in data analysis. Accordingly, we only dealt with PHD values for hd = 1, 2, and 3 in this study. Since mutations can occur during library preparation or a sequencing step, rates for two error types (those for polymerases and sequencing platforms) warrant consideration. The former is known to range from 1/106 to 1/105 [32] and the latter approximately 1/103 with the Illumina sequencing platform [29]. Of these, we selected the latter, because it is one thousand times higher than that of the former. Accordingly, if the probability that a mutation is detected in a sequence is m (= 1/103), the probabilities that two and three mutations simultaneously occur would be m2 (= 1/106) and m3 (= 1/109), respectively. In this context, we defined PHD values of barcodes with hd = 1, 2, and 3 as 106, 103, and 1, respectively, by dividing the probabilities by 1/109. In addition, we set PHD scores for barcodes with hd≥4 to 0, since the values were smaller than 1 and closer to 0 in the hd range (for example, 1/103 and 1/106 for those with hd = 4 and hd = 5, respectively). In addition, three look-up-tables containing precalculated hd values for short subsequences with l = 2, 3, and 4 are used to reduce computation time in HD calculation. For example, a barcode pair with l = 8 is composed of two pairs of subsequences with l = 4. Another pair with l = 9 is composed of three pairs of subsequences with l = 2, 3, and 4. Another pair l = 10 is composed of two pairs of subsequences with l = 4 and a pair of subsequences with l = 2. So, the total hd value of the pair can be obtained by summing hd values of subsequence pairs referred from look-up-tables. Note that longer subsequences are preferentially considered than shorter subsequences when referring to hd values from the look-up-tables.

To define penalty scores of CP, we employed maximum Watson-Crick base pairs between given barcodes rather than predicted free energies that has been widely used to predict DNA cross-hybridizations [33], because we had to consider computational time of free energies for all combinations of them. We also assumed that the strength of CP would increase dramatically if the cp between them became greater than a certain limit. Otherwise, it would become close or equal to its minimum value. We set the limit to 2/3 of l. The assumption aims to prevent inclusion of barcodes in relationships of full reverse complementary or close to it. For barcodes with l = 8, 10, and 12, the limits become 5.3, 7.5, and 9, respectively. The maximum value of PCP score was set to 108 for the barcodes in full complementary relationship. We also determined that the score will decrease by 102 when cp decreases by 1. Thus, PCP score was decreased to 1 at cp = 8. To reflect the nature of PCP, we chose an exponential model with the formula f(x) = a + b × ecx. In the equation, constant terms (a) were removed to obtain a f(x) value greater than or equal to 0.

Unlike penalty scores for GCC, HP, and SR, those for HD and CP are obtained by pairwise comparison between two barcodes and, thus, demand a greater number of calculations. To reduce calculation times for PHP and PCP for a barcode pair, we used look-up-tables containing pre-calculated hd values for two barcodes with l = 2, 3, and 4, respectively, as described above. Nonetheless, a total of N22N calculations is required to obtain them for all barcode pairs, where N represents the total number of barcodes.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A