Data were processed using the coproID pipeline v1.0 (Fig. 2) (DOI 10.5281/zenodo.2653757) written using Nextflow (Di Tommaso et al., 2017) and made available through nf-core (Ewels et al., 2019). Nextflow is a Domain Specific Language designed to ensure reproducibility and scalability for scientific pipelines, and nf-core is a community-developed set of guidelines and tools to promote standardization and maximum usability of Nextflow pipelines. CoproID consists of 5 different steps:
CoproID consists of five steps: Preprocessing (orange), Mapping (blue), Computing host DNA content for each metagenome (red), Metagenomic profiling (green), and Reporting (violet). Individual programs (squared boxes) are colored by category (rounded boxes).
Fastq sequencing files are given as an input. After quality control analysis with FastQC (Andrews, 2010), raw sequencing reads are cleaned from sequencing adapters and trimmed from ambiguous and low-quality bases with a QScore below 20, while reads shorter than 30 base pairs are discarded using AdapterRemoval v2. By default, paired-end reads are merged on overlapping base pairs.
The preprocessed reads are then aligned to each of the target species genomes (source species) by Bowtie2 with the – very-sensitive preset while allowing for a mismatch in the seed search (-N 1). When running coproID with the ancient DNA mode (–adna), alignments are filtered by PMDtools (Skoglund et al., 2014) to only retain reads showing post-mortem damages (PMD). PMDtools default settings are used, with specified library type, and only reads with a PMDScore greater than three are kept.
Next, filtered alignments are processed in Python using the Pysam library (Pysam Developers, 2018). Reads matching above the identity threshold of 0.95 to multiple host genomes are flagged as common reads readscommons whereas reads mapping above the identity threshold to a single host genome are flagged as genome-specific host reads readsspec g to each genome g. Each source species host DNA is normalized by genome size and gut microbiome host DNA content such as:
where for each species of genome g, ∑length(readsspec g) is the total length of all readsspec g, genomeg length is the size of the genome, and endog is the host DNA proportion in the species gut microbiome.
Afterwards, an host DNA ratio is computed for each source species such as:
where ∑NormalizedHost DNA(source species) is the sum of all source species Normalized Host DNA.
Adapter clipped and trimmed reads are given as an input to Kraken 2 (Wood & Salzberg, 2014). Using the MiniKraken2_v2_8GB database (2019/04/23 version), Kraken 2 performs the taxonomic classification to output a taxon count per sample report file. All samples’ taxon counts are pooled together in a taxon counts matrix with samples in columns, and taxons in rows. Next, Sourcepredict (Borry, 2019b) is used to predict the source based on each microbiome sample taxon composition. Using dimension reduction and K-Nearest Neighbors (KNN) machine learning trained with reference modern gut microbiomes samples (Table 1), Sourcepredict estimates a proportion propmicrobiome(source species) of each potential source species, here Human or Dog, for each sample.
For each filtered alignment file, the DNA damage patterns are estimated with DamageProfiler (Peltzer & Neukamm, 2019). The information from the host DNA content and the metagenomic profiling are gathered for each source in each sample such as:
Finally, a summary report is generated including the damage plots, a summary table of the coproID metrics, and the embedding of the samples in two dimensions by Sourcepredict. coproID is available on GitHub at the following address: github.com/nf-core/coproid.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.