coproID pipeline

Maxime Borry; Bryan Cordova; Angela Perri; Marsha Wibowo; Tanvi Prasad Honap; Jada Ko; Jie Yu; Kate Britton; Linus Girdland-Flink; Robert C. Power; Ingelise Stuijts; Domingo C. Salazar-García; Courtney Hofman; Richard Hagan; Thérèse Samdapawindé Kagoné; Nicolas Meda; Helene Carabin; David Jacobson; Karl Reinhard; Cecil Lewis; Aleksandar Kostic; Choongwon Jeong; Alexander Herbig; Alexander Hübner; Christina Warinner

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

coproID pipeline

MB Maxime Borry

BC Bryan Cordova

AP Angela Perri

MW Marsha Wibowo

TH Tanvi Prasad Honap

JK Jada Ko

JY Jie Yu

KB Kate Britton

LG Linus Girdland-Flink

RP Robert C. Power

IS Ingelise Stuijts

DS Domingo C. Salazar-García

CH Courtney Hofman

RH Richard Hagan

TK Thérèse Samdapawindé Kagoné

NM Nicolas Meda

HC Helene Carabin

DJ David Jacobson

KR Karl Reinhard

CL Cecil Lewis

AK Aleksandar Kostic

CJ Choongwon Jeong

AH Alexander Herbig

AH Alexander Hübner

CW Christina Warinner

This method is extracted from research article: PeerJ, Apr 2020

CoproID predicts the source of coprolites and paleofeces using microbiome composition and host DNA content

DOI: 10.7717/peerj.9001

Request a Protocol

Ask a question

Favorite

Data were processed using the coproID pipeline v1.0 (Fig. 2) (DOI 10.5281/zenodo.2653757) written using Nextflow (Di Tommaso et al., 2017) and made available through nf-core (Ewels et al., 2019). Nextflow is a Domain Specific Language designed to ensure reproducibility and scalability for scientific pipelines, and nf-core is a community-developed set of guidelines and tools to promote standardization and maximum usability of Nextflow pipelines. CoproID consists of 5 different steps:

CoproID consists of five steps: Preprocessing (orange), Mapping (blue), Computing host DNA content for each metagenome (red), Metagenomic profiling (green), and Reporting (violet). Individual programs (squared boxes) are colored by category (rounded boxes).

Fastq sequencing files are given as an input. After quality control analysis with FastQC (Andrews, 2010), raw sequencing reads are cleaned from sequencing adapters and trimmed from ambiguous and low-quality bases with a QScore below 20, while reads shorter than 30 base pairs are discarded using AdapterRemoval v2. By default, paired-end reads are merged on overlapping base pairs.

The preprocessed reads are then aligned to each of the target species genomes (source species) by Bowtie2 with the – very-sensitive preset while allowing for a mismatch in the seed search (-N 1). When running coproID with the ancient DNA mode (–adna), alignments are filtered by PMDtools (Skoglund et al., 2014) to only retain reads showing post-mortem damages (PMD). PMDtools default settings are used, with specified library type, and only reads with a PMDScore greater than three are kept.

Next, filtered alignments are processed in Python using the Pysam library (Pysam Developers, 2018). Reads matching above the identity threshold of 0.95 to multiple host genomes are flagged as common reads reads_commons whereas reads mapping above the identity threshold to a single host genome are flagged as genome-specific host reads reads_{spec g} to each genome g. Each source species host DNA is normalized by genome size and gut microbiome host DNA content such as:

where for each species of genome g, ∑length(reads_{spec g}) is the total length of all reads_{spec g}, genome_{g length} is the size of the genome, and endo_g is the host DNA proportion in the species gut microbiome.

Afterwards, an host DNA ratio is computed for each source species such as:

where ∑NormalizedHost DNA(source species) is the sum of all source species Normalized Host DNA.

Adapter clipped and trimmed reads are given as an input to Kraken 2 (Wood & Salzberg, 2014). Using the MiniKraken2_v2_8GB database (2019/04/23 version), Kraken 2 performs the taxonomic classification to output a taxon count per sample report file. All samples’ taxon counts are pooled together in a taxon counts matrix with samples in columns, and taxons in rows. Next, Sourcepredict (Borry, 2019b) is used to predict the source based on each microbiome sample taxon composition. Using dimension reduction and K-Nearest Neighbors (KNN) machine learning trained with reference modern gut microbiomes samples (Table 1), Sourcepredict estimates a proportion prop_microbiome(source species) of each potential source species, here Human or Dog, for each sample.

For each filtered alignment file, the DNA damage patterns are estimated with DamageProfiler (Peltzer & Neukamm, 2019). The information from the host DNA content and the metagenomic profiling are gathered for each source in each sample such as:

Finally, a summary report is generated including the damage plots, a summary table of the coproID metrics, and the embedding of the samples in two dimensions by Sourcepredict. coproID is available on GitHub at the following address: github.com/nf-core/coproid.

This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol