We developed the PathoSPOT-compare pipeline [11] to perform comparative phylogenomic analysis of annotated genome assemblies for the specific purpose of outbreak detection. The pipeline is implemented as a Rakefile (a Makefile for the Ruby language) that calculates dependencies and executes all necessary subtasks to reach desired outputs. PathoSPOT-compare takes FASTA-formatted genome assemblies as input, along with a relational database (SQLite or MySQL) containing metadata for each assembly (including collection time, location, collection method, organism, and patient ID), as well as metadata on patient admission/discharge/transfer (ADT) history (for spatiotemporal analysis).
Genetic distances for outbreak detection are ultimately calculated by counting single nucleotide variant (SNV) differences within core-genome alignments; however, there is a trade-off between aligning increasingly diverse assemblies and a diminishing core-genome size (as more subsequences will fail to align across all assemblies). Therefore, we implemented a hybrid approach, wherein pairwise distances between all assemblies are first estimated using Mash [12], which uses a k-mer-based hashing approach that approximates average nucleotide identity (ANI). Mash distances are used to perform greedy single-linkage hierarchical pre-clustering, with pre-clusters capped at a pre-specified diameter and size. The default parameters, which are also the parameters used for this study, are a maximum Mash pre-cluster diameter of 0.02 (approximating 98% ANI among all included genomes) and at most 100 genomes per pre-cluster.
Rapid core-genome alignments are then created for each pre-cluster using parsnp [13], which is tailored for intraspecific genome analysis and is therefore well-suited for outbreak analysis. Outputted variant call files (VCF) for each pre-cluster are converted to NumPy arrays (NPZ files) for fast loading and subsetting of variant data by PathoSPOT-visualize, the downstream visualization web application that can display called variants alongside phylogenies. The primary output for PathoSPOT-visualize is a JSON file containing a matrix of pairwise SNV distances for all genomes (with inter-pre-cluster distances left unspecified) and a maximum-likelihood phylogeny for each pre-cluster. Additional optional pipeline tasks export patient location data (as TSV files) and epidemiological data on positive and negative culture results (as JSON files), both of which are automatically utilized and layered onto the comparative genomic analyses within PathoSPOT-visualize when available.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.