SAKE parameter selection

ML Miika Leinonen
LS Leena Salmela
AE Alfonso Esposito
AE Alfonso Esposito
AE Alfonso Esposito
request Request a Protocol
ask Ask a question
Favorite

SAKE has a few parameters the user can adjust to affect the precision and recall of the reported k-mers. Also, the modified BFCounter has new parameters to determine the shape of the strobemers. These parameters are listed in Table 1. Before the experiments shown in the Results section were performed, we needed to find what values to use for these parameters.

Our goal was not to fully optimize the parameters since this would have taken unnecessarily long. Instead, we were satisfied with using a reasonable set of parameter values if tweaking them would not significantly improve the precision and recall of the produced k-mers. For some parameters, we could reason out the values that would work well. For the rest that could not be as easily reasoned, we made initial guesses and performed experiments to see if adjusting them would make a meaningful difference. Of course, this does not necessarily lead to an optimal set of parameter values. Still, the amount of time it would have taken to find the best parameter values for a marginal increase in result quality did not seem worthwhile, especially since the “best” parameter values are dependent on the used data set. Also, it is difficult to determine what are the optimal parameter values, since there is often a tradeoff between precision and recall.

The experiments to determine parameter values were performed with multiple runs on simulated 40x 6% error rate E. coli reads. We reasoned out the values for parameters n and l, but for the rest of the parameters we made the following initial guesses: w = 11, a = 3, g = 3, z = 0.4, b = 0.8, and c = 3. The value of parameter v is not fixed a priori, because it was used to adjust the size of the area covered by the strobemer to fit each individual k-mer length according to Eq (1) in the Output k-mer splitting section.

First, we decided which values to use for the strobemer parameters n, l, v and w. This choice is guided by the inequality (n − 1) × vw + l + 1 ≥ k defined in Eq (1), which ensures that the sequences that are used for consensus generation are of length at least k. Fixing some of the parameter values will affect what values we can choose for the other parameters. We decided that the length of the strobemer (the sum of the strobe lengths nl, not including the gaps between them) should not be dependent on the length of the searched k-mers. The reasoning behind this was that the strobemer characters are the only ones that are fixed in consensus sequence generation, and we wanted the probability that they contain an error to be the same in all experiments regardless of the k-mer length. Thus, we used the same value for n and l in all experiments.

When deciding the value for l, we should consider that we want it to be likely that a minimizer should be unique in its window. For parameter n we also need to take into account that the number of the minimizers should be at least two (and preferably more) for the inequality (and the strobemer itself) to make sense. On the other hand, we do not want the strobemer length nl to be too long, because the longer the strobemer is, the more likely it is that the strobemer contains an error. Sahlin [12] ran experiments where the total strobemer length was 30. We tried using a similar setup with parameters n = 3 and l = 10 which leads to the same strobemer length. Unfortunately, this did not work out well enough and the recall rate was not as high as we would have hoped. We assume that in SAKE the effects of errors in the read are multiplied because we are matching strobemers to other strobemers in the grouping step. On the other hand, Sahlin [12] only matched strobemers to a reference genome, so the longer strobemers are tolerated better with similar error rates. We deduced the strobemers of length 30 made it too probable for them to contain an error making it difficult to identify matching strobemers. We had to adjust the parameters so that the total strobemer length was lower. We decided to keep the number of minimizers at three but lower the minimizer length to seven, leading to the strobemers being 21 characters long. With this setup, we observed that the recall increased to a reasonable range and we were satisfied with the chosen values.

Next, the size of the minimizer window w should be chosen so that even if the areas between the windows have insertions and deletions causing the position of the minimizer window to shift from its correct position, it is still likely the correct minimizer can be found within the minimizer window. Thus, the window should contain a good number of minimizer candidates so the majority of them stay in the intended window even if a few insertions or deletions shift the position of the window. On the other hand, we do not want the window to be too big to ensure there is a unique minimizer in the window (which is also affected by the minimizer length l). More importantly, if the window is too long, the resulting strobemers are less likely to be unique to their specific position in the genome causing us to group sequences from different areas of the genome making the consensus generation more difficult. We wanted the window length w to be at least l long so there was no character in the window that was shared by all the minimizer candidates. We then chose a few test values w = 9, 11, 13, 15 around our initial guess w = 11 to see if the guessed value should be adjusted. In these experiments, we used the already chosen n = 3 and l = 7 parameters, and the initial guess values for the other parameters (v was chosen based on Eq (1)). We searched for 47-mers and 97-mers and measured the precision and recall using the k-mers found in the reference genome. The results can be seen in Fig 5.

This experiment was used to evaluate the initial parameter guess w = 11 for the minimizer window length. We performed this experiment on simulated 40x 6% error rate E. coli data. Note that the y-axis starts from 0.8.

We observed that the effects of the window size between these values were very small. The precision had a tiny decrease as the window size grew larger. Nevertheless, both precision and recall were good with all the tested values, and we chose to keep our initial guess value w = 11 for the rest of the experiments.

Finally, for the strobemer parameters, we are left to set the value for the distance between minimizer windows v. We used this parameter to fit the different values of k. And as mentioned earlier, the value of v is not fixed and it is used to scale the length of the sequence covered by the strobemer to fit the different k-mer lengths.

After this, we were ready to decide the non-strobemer parameters. As previously mentioned, our initial guesses for them were b = 0.8, z = 0.4, and a, g, c = 3. First, we decided to check if the bundling threshold b = 0.8 was reasonable. We used the already decided strobemer parameters n = 3, l = 7, and w = 11, and the initial guesses for the other parameters. We tried using values 0.5, 0.6, 0.7, 0.8, and 0.9 for the bundling threshold. The results can be seen in Fig 6. We observed that with the test data set, the difference in precision and recall between these values was minuscule. The experiments did not definitively indicate which value was the best, so we decided to stick with our initial guess b = 0.8 for the bundling threshold. We thought this would allow enough flexibility for the sequences to be bundled together even with some errors, without bundling sequences that are obviously not from the same genomic region.

This experiment was used to evaluate the initial parameter guess b = 0.8 for the bundling threshold. We performed this experiment on simulated 40x 6% error rate E. coli data. Note that the y-axis starts from 0.8.

After b was fixed, we moved on to find a value for the relative bundle support threshold z. This value determines if the bundle size is big enough for the consensus sequence to be accepted. If the number of sequences in the bundle divided by the number of all sequences in the PO-MSA graph is at least z, the consensus sequence is considered to have enough support. Our initial guess for this parameter was 0.4, but we also experimented with values 0.2, 0.3, 0.5, and 0.6. One thing that should be noted about this parameter is that its value determines what is the maximum number of consensus sequences we can produce from a single PO-MSA graph. For example, if one sequence has enough support when z = 0.6, then the next consensus sequence can have at most 1−0.6 = 0.4 support because the sequences of the first bundle are removed from the graph. We ran similar tests as before using the chosen strobemer parameters, b = 0.8, and initial guesses a, g, c = 3 to search for 47-mers and 97-mers. The result of this experiment can be seen in Fig 7. Again, we saw very little difference between the tested values, which implied that most of the time a single graph produced at most one consensus sequence. Still, we chose to keep the guessed value z = 0.4, which allows one graph to produce two consensus sequences while requiring them to have a reasonably large supporting bundle.

This experiment was used to evaluate the initial parameter guess z = 0.4 for the relative bundle support threshold. We performed this experiment on simulated 40x 6% error rate E. coli data. Note that the y-axis starts from 0.8.

Finally, we were left to find what values to choose for parameters a, g, and c. We decided to group these parameters together because they are dependent on each other. For example, if c = 5, the characters in a consensus sequence are required to have support from five bundled sequences. If this is the case, it does not make sense to set parameter g to be less than five, because if the bundle size was allowed to be four, then no character in the consensus sequence would have enough support. Similarly, it would not make sense to set a to be less than five, because the number of strobemer occurrences determines how many sequences appear in the PO-MSA graph, which directly sets a maximum value for the bundle size. Because of these parameter dependencies, we decided to treat them as a single parameter when choosing their values.

As our initial guess, we had set all these parameters to be 3. Lower values could not bring better results, so we ran experiments where we set these parameters to be 3, 5, 7, 9, and 11 and used the other already decided parameter values to extract 47-mers and 97-mers. The results of these experiments can be seen in Fig 8. The higher values 7, 9, and 11 did not work well with the data set, so we had to decide between the smaller values. With a, g, c = 3 the recall was better than with a, g, c = 5, while the precision was lower. Looking at these results the best overall value for a, g, c would probably be 4 or 5. Thus, in the Results section, we decided to show experiment results with three different values 3, 4, and 5.

This experiment was used to evaluate the initial parameter guesses a, g, c = 3 for the minimum strobemer abundance a, absolute bundle support threshold g, and character support threshold c. We performed this experiment on simulated 40x 6% error rate E. coli data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A