Determination of penalty scores for evaluating individual barcodes

In Seok Yang; Sang Won Bae; BeumJin Park; Sangwoo Kim

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Determination of penalty scores for evaluating individual barcodes

IY In Seok Yang

SB Sang Won Bae

BP BeumJin Park

SK Sangwoo Kim

This method is extracted from research article: PLoS One, Feb 2021

Development of a program for in silico optimized selection of oligonucleotide-based molecular barcodes

DOI: 10.1371/journal.pone.0246354

Request a Protocol

Ask a question

Favorite

To apply the barcode factors to the barcode set optimization, we used a penalty-based selection strategy. To determine the most appropriate penalty scores for GCC, HP, and SR (P_GCC, P_HP, and P_SR, respectively), we obtained appropriate equations using the curve fitting method on the website MyCurveFit [31] based on distributions observed in random barcodes with l = 12. To obtain gcc distribution, we calculated mean barcode counts at each gcc value. A Gaussian bell curve with the formula $f (x) = a \times e^{- {(x - b)}^{2} / 2 c^{2}}$ was used to fit its distribution, where x indicates gcc. We set f(gcc) to be maximum if gcc was in the range from 40% to 60% and minimum at the points gcc = 0% or 100%. Then, we determined P_GCC values with a maximum value of 10⁶ at the points gcc = 0% or 100% and a minimum value of 0 in the range of gcc = 40% to 60% by inverting the distribution of f(gcc) values. For distributions of hp and sr, we adopted median barcode counts at each data point of hp and sr, respectively. When hp and sr were examined in a barcode sequence, we only considered them greater than or equal to 2, because there are many cases with hp = 1 and sr = 1 in a barcode. An exponential curve obtained with the formula f(x) = a + b × e^cx was applied to their random distributions, where x represents hp or sr. In the equations, constant terms (a) were removed to obtain f(hp) or f(sr) values greater than or equal to 0. We determined P_HP and P_SR scores by subtracting f(hp) and f(sr) values from 10⁶, respectively, with a modification to transform penalty scores into positive integers.

To generate penalty scores for HD (P_HD), we implemented the concept of accumulation of mutations (herein, substitutions of a base at a nucleotide position) to represent distances or differences between barcode sequences. In a given barcode set with l = n, there exist barcode pairs with a relationship of hd values ranging from 1 to n. Since barcode sequences are sequenced with sample DNA or RNA fragments, unexpected mutations can be observed in the sequences. In result, barcodes with hd = 1, 2, or 3 are more sensitive to misclustering than others, sometimes leading to inaccurate interpretation in data analysis. Accordingly, we only dealt with P_HD values for hd = 1, 2, and 3 in this study. Since mutations can occur during library preparation or a sequencing step, rates for two error types (those for polymerases and sequencing platforms) warrant consideration. The former is known to range from 1/10⁶ to 1/10⁵ [32] and the latter approximately 1/10³ with the Illumina sequencing platform [29]. Of these, we selected the latter, because it is one thousand times higher than that of the former. Accordingly, if the probability that a mutation is detected in a sequence is m (= 1/10³), the probabilities that two and three mutations simultaneously occur would be m² (= 1/10⁶) and m³ (= 1/10⁹), respectively. In this context, we defined P_HD values of barcodes with hd = 1, 2, and 3 as 10⁶, 10³, and 1, respectively, by dividing the probabilities by 1/10⁹. In addition, we set P_HD scores for barcodes with hd≥4 to 0, since the values were smaller than 1 and closer to 0 in the hd range (for example, 1/10³ and 1/10⁶ for those with hd = 4 and hd = 5, respectively). In addition, three look-up-tables containing precalculated hd values for short subsequences with l = 2, 3, and 4 are used to reduce computation time in HD calculation. For example, a barcode pair with l = 8 is composed of two pairs of subsequences with l = 4. Another pair with l = 9 is composed of three pairs of subsequences with l = 2, 3, and 4. Another pair l = 10 is composed of two pairs of subsequences with l = 4 and a pair of subsequences with l = 2. So, the total hd value of the pair can be obtained by summing hd values of subsequence pairs referred from look-up-tables. Note that longer subsequences are preferentially considered than shorter subsequences when referring to hd values from the look-up-tables.

To define penalty scores of CP, we employed maximum Watson-Crick base pairs between given barcodes rather than predicted free energies that has been widely used to predict DNA cross-hybridizations [33], because we had to consider computational time of free energies for all combinations of them. We also assumed that the strength of CP would increase dramatically if the cp between them became greater than a certain limit. Otherwise, it would become close or equal to its minimum value. We set the limit to 2/3 of l. The assumption aims to prevent inclusion of barcodes in relationships of full reverse complementary or close to it. For barcodes with l = 8, 10, and 12, the limits become 5.3, 7.5, and 9, respectively. The maximum value of P_CP score was set to 10⁸ for the barcodes in full complementary relationship. We also determined that the score will decrease by 10² when cp decreases by 1. Thus, P_CP score was decreased to 1 at cp = 8. To reflect the nature of P_CP, we chose an exponential model with the formula f(x) = a + b × e^cx. In the equation, constant terms (a) were removed to obtain a f(x) value greater than or equal to 0.

Unlike penalty scores for GCC, HP, and SR, those for HD and CP are obtained by pairwise comparison between two barcodes and, thus, demand a greater number of calculations. To reduce calculation times for P_HP and P_CP for a barcode pair, we used look-up-tables containing pre-calculated hd values for two barcodes with l = 2, 3, and 4, respectively, as described above. Nonetheless, a total of $\frac{N^{2}}{2} - N$ calculations is required to obtain them for all barcode pairs, where N represents the total number of barcodes.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol