To make our study fully reproducible, we released an open-source version-controlled pipeline called the MetaSUB Core Analysis Pipeline (CAP) (Danko and Mason, 2020). This pipeline includes all steps from extracting data from raw sequence FASTQ files to producing refined results like taxonomic and functional profiles. Every tool in the CAP is open source with a permissive license. The CAP is available as a docker container for easier installation in some instances, and all databases used in the CAP are available for public download. The CAP is versioned and includes all necessary databases, allowing researchers to replicate results and figures.

The MetaSUB dataset and CAP are built and organized for full accessibility to other researchers. This is consistent with the concept of Open Science. Specifically, we built our study with the FAIR principles in mind: Findable, Accessible, Interoperable, and Reusable. To make our results more reproducible and accessible, we have developed a program to merge the CAP’s output into a condensed data-packet. This data packet contains results as a series of Tidy-style data tables with descriptions. The advantage of this set-up is that result tables for an entire dataset can be parsed with a single command in most high level analysis languages like Python and R. This package also contains Python utilities for parsing and analyzing data packets which streamline most of the boilerplate tasks of data analysis. All development of the CAP and data packet builder (Capalyzer) package is open source and permissively licensed.

In addition to general-purpose data analysis tools, essentially all analysis in this paper is available as a series of Jupyter notebooks. These notebooks allow researchers to reproduce our results, build upon our results in different contexts, and better understand precisely how we arrived at our conclusions. By providing the exact source used to generate our analyses and figures, users can quickly incorporate new data or correct any bugs.

For less technical purposes, we also provide web-based interactive visualizations of our dataset (typically broken into city-specific groups). These visualizations are intended to provide a quick reference for major results as well as an exploratory platform for generating novel hypotheses and serendipitous discovery. The web platform used, MetaGenScope, is open source, permissively licensed, and can be run on a moderately powerful machine (though its output relies on results from the MetaSUB CAP).

Our hope is that by making our dataset open and easily accessible to other researchers the scientific community can more rapidly generate and test hypotheses. One of the core goals of the MetaSUB consortium is to build a dataset that benefits public health. As the project develops, we want to make our data easy to use and access for clinicians and public health officials who may not have computational or microbiological expertise. We intend to continue to build tooling that supports these goals.

Since 2017, MetaSUB has partnered with the Critical Assessment of Massive Data Analysis (CAMDA) camda.info, a whole conference track at the Intelligent Systems for Molecular Biology (ISMB) Conference. At this venue, a subset of the MetaSUB data was released to the CAMDA community in the form of an annual challenge addressing the issue of geographically locating samples: ‘The MetaSUB Inter-City Challenge’ in 2017 and ‘The MetaSUB Forensics Challenge’ in 2018 and 2019. In the latter challenge the MetaSUB data has been complemented by data from EMP (Thompson et al., 2017) and other studies (Delgado-Baquerizo et al., 2018; Hsu et al., 2016). This Open Science approach of CAMDA has generated multiple interesting results and concepts relating to urban microbiomics, resulting in several publications https://biologydirect.biomedcentral.com/articles/collections/camdaproc as well as perspective manuscript about moving toward metagenomics in the intelligence community (Mason-Buck et al., 2020). The partnership is continued in 2020 with ‘The Metagenomic Geolocation Challenge’ where the MetaSUB data has been complemented by the climate/weather data in order to construct multi-source microbiome fingerprints and predict the originating ecological niche of the sample.

All data from this study including data tables that resulted from analyses may be found at https://pngb.io/metasub-2021. Additionally, raw sequencing reads are uploaded to the SRA and may be found under the accession SRA ID: PRJNA732392.

