All non-clonal isolates were selected for analysis, whereas only a single isolate was randomly selected from each infection sample. In total 62 non-clonal isolates (non-clonal group) and 23 infection isolates (infection group) were analysed.
Contigs were split into overlapping 30-mers using Jellyfish (version 2.2.10, parameters used: -m 30 -s 150M -C) [123]. The k-mer sequences were saved in FASTA format and their abundances were compiled in a table.
A 2 x 2 contingency table was generated for each k-mer. To test for significant differences between the two groups, Fisher’s exact test was applied[38]. Only k-mers occurring with significantly (p-value ≤ 0.001) different abundances between both groups were considered for further analysis.
For each isolate a binary vector, indicating the presence of each k-mer, was constructed. Based on these vectors, a random forest model was trained for classifying whether an isolate belongs to the infection group or non-clonal group. The Python package Scikit-learn [124] was used for this task. Finally, k-mers, were sorted by their feature importance. Only the k-mers with a feature importance greater than zero were kept.
A pan-genome consisting of all isolates was constructed using Roary 3.12.0 [106] with results previously generated by Prokka [125]. The python packages biopython and regex were then used associate k-mers with the genomic loci they originated from.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.