K-mer based genome wide association study

Anna Both; Jiabin Huang; Minyue Qi; Christian Lausmann; Samira Weißelberg; Henning Büttner; Susanne Lezius; Antonio Virgilio Failla; Martin Christner; Marc Stegger; Thorsten Gehrke; Sharmin Baig; Mustafa Citak; Malik Alawi; Martin Aepfelbacher; Holger Rohde

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

K-mer based genome wide association study

AB Anna Both

JH Jiabin Huang

MQ Minyue Qi

CL Christian Lausmann

SW Samira Weißelberg

HB Henning Büttner

SL Susanne Lezius

AF Antonio Virgilio Failla

MC Martin Christner

MS Marc Stegger

TG Thorsten Gehrke

SB Sharmin Baig

MC Mustafa Citak

MA Malik Alawi

MA Martin Aepfelbacher

HR Holger Rohde

This method is extracted from research article: PLoS Pathog, Feb 2021

Distinct clonal lineages and within-host diversification shape invasive Staphylococcus epidermidis populations

DOI: 10.1371/journal.ppat.1009304

Request a Protocol

Ask a question

Favorite

All non-clonal isolates were selected for analysis, whereas only a single isolate was randomly selected from each infection sample. In total 62 non-clonal isolates (non-clonal group) and 23 infection isolates (infection group) were analysed.

Contigs were split into overlapping 30-mers using Jellyfish (version 2.2.10, parameters used: -m 30 -s 150M -C) [123]. The k-mer sequences were saved in FASTA format and their abundances were compiled in a table.

A 2 x 2 contingency table was generated for each k-mer. To test for significant differences between the two groups, Fisher’s exact test was applied[38]. Only k-mers occurring with significantly (p-value ≤ 0.001) different abundances between both groups were considered for further analysis.

For each isolate a binary vector, indicating the presence of each k-mer, was constructed. Based on these vectors, a random forest model was trained for classifying whether an isolate belongs to the infection group or non-clonal group. The Python package Scikit-learn [124] was used for this task. Finally, k-mers, were sorted by their feature importance. Only the k-mers with a feature importance greater than zero were kept.

A pan-genome consisting of all isolates was constructed using Roary 3.12.0 [106] with results previously generated by Prokka [125]. The python packages biopython and regex were then used associate k-mers with the genomic loci they originated from.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol