Comparative genome analysis using PathoSPOT-compare

Ana Berbel Caban; Theodore R. Pak; Ajay Obla; Amy C. Dupper; Kieran I. Chacko; Lindsey Fox; Alexandra Mills; Brianne Ciferri; Irina Oussenko; Colleen Beckford; Marilyn Chung; Robert Sebra; Melissa Smith; Sarah Conolly; Gopi Patel; Andrew Kasarskis; Mitchell J. Sullivan; Deena R. Altman; Harm van Bakel

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Comparative genome analysis using PathoSPOT-compare

AC Ana Berbel Caban

TP Theodore R. Pak

AO Ajay Obla

AD Amy C. Dupper

KC Kieran I. Chacko

LF Lindsey Fox

AM Alexandra Mills

BC Brianne Ciferri

IO Irina Oussenko

CB Colleen Beckford

MC Marilyn Chung

RS Robert Sebra

MS Melissa Smith

SC Sarah Conolly

GP Gopi Patel

AK Andrew Kasarskis

MS Mitchell J. Sullivan

DA Deena R. Altman

HB Harm van Bakel

This method is extracted from research article: Genome Med, Nov 2020

PathoSPOT genomic epidemiology reveals under-the-radar nosocomial outbreaks

DOI: 10.1186/s13073-020-00798-3

Request a Protocol

Ask a question

Favorite

We developed the PathoSPOT-compare pipeline [11] to perform comparative phylogenomic analysis of annotated genome assemblies for the specific purpose of outbreak detection. The pipeline is implemented as a Rakefile (a Makefile for the Ruby language) that calculates dependencies and executes all necessary subtasks to reach desired outputs. PathoSPOT-compare takes FASTA-formatted genome assemblies as input, along with a relational database (SQLite or MySQL) containing metadata for each assembly (including collection time, location, collection method, organism, and patient ID), as well as metadata on patient admission/discharge/transfer (ADT) history (for spatiotemporal analysis).

Genetic distances for outbreak detection are ultimately calculated by counting single nucleotide variant (SNV) differences within core-genome alignments; however, there is a trade-off between aligning increasingly diverse assemblies and a diminishing core-genome size (as more subsequences will fail to align across all assemblies). Therefore, we implemented a hybrid approach, wherein pairwise distances between all assemblies are first estimated using Mash [12], which uses a k-mer-based hashing approach that approximates average nucleotide identity (ANI). Mash distances are used to perform greedy single-linkage hierarchical pre-clustering, with pre-clusters capped at a pre-specified diameter and size. The default parameters, which are also the parameters used for this study, are a maximum Mash pre-cluster diameter of 0.02 (approximating 98% ANI among all included genomes) and at most 100 genomes per pre-cluster.

Rapid core-genome alignments are then created for each pre-cluster using parsnp [13], which is tailored for intraspecific genome analysis and is therefore well-suited for outbreak analysis. Outputted variant call files (VCF) for each pre-cluster are converted to NumPy arrays (NPZ files) for fast loading and subsetting of variant data by PathoSPOT-visualize, the downstream visualization web application that can display called variants alongside phylogenies. The primary output for PathoSPOT-visualize is a JSON file containing a matrix of pairwise SNV distances for all genomes (with inter-pre-cluster distances left unspecified) and a maximum-likelihood phylogeny for each pre-cluster. Additional optional pipeline tasks export patient location data (as TSV files) and epidemiological data on positive and negative culture results (as JSON files), both of which are automatically utilized and layered onto the comparative genomic analyses within PathoSPOT-visualize when available.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol