K-mer based genome wide association study

AB Anna Both
JH Jiabin Huang
MQ Minyue Qi
CL Christian Lausmann
SW Samira Weißelberg
HB Henning Büttner
SL Susanne Lezius
AF Antonio Virgilio Failla
MC Martin Christner
MS Marc Stegger
TG Thorsten Gehrke
SB Sharmin Baig
MC Mustafa Citak
MA Malik Alawi
MA Martin Aepfelbacher
HR Holger Rohde
request Request a Protocol
ask Ask a question
Favorite

All non-clonal isolates were selected for analysis, whereas only a single isolate was randomly selected from each infection sample. In total 62 non-clonal isolates (non-clonal group) and 23 infection isolates (infection group) were analysed.

Contigs were split into overlapping 30-mers using Jellyfish (version 2.2.10, parameters used: -m 30 -s 150M -C) [123]. The k-mer sequences were saved in FASTA format and their abundances were compiled in a table.

A 2 x 2 contingency table was generated for each k-mer. To test for significant differences between the two groups, Fisher’s exact test was applied[38]. Only k-mers occurring with significantly (p-value ≤ 0.001) different abundances between both groups were considered for further analysis.

For each isolate a binary vector, indicating the presence of each k-mer, was constructed. Based on these vectors, a random forest model was trained for classifying whether an isolate belongs to the infection group or non-clonal group. The Python package Scikit-learn [124] was used for this task. Finally, k-mers, were sorted by their feature importance. Only the k-mers with a feature importance greater than zero were kept.

A pan-genome consisting of all isolates was constructed using Roary 3.12.0 [106] with results previously generated by Prokka [125]. The python packages biopython and regex were then used associate k-mers with the genomic loci they originated from.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A