The ChocoPhlAn 3 pipeline

Francesco Beghini; Lauren J McIver; Aitor Blanco-Míguez; Leonard Dubois; Francesco Asnicar; Sagun Maharjan; Ana Mailyan; Paolo Manghi; Matthias Scholz; Andrew Maltez Thomas; Mireia Valles-Colomer; George Weingart; Yancong Zhang; Moreno Zolfo; Curtis Huttenhower; Eric A Franzosa; Nicola Segata

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

The ChocoPhlAn 3 pipeline

FB Francesco Beghini

LM Lauren J McIver

AB Aitor Blanco-Míguez

LD Leonard Dubois

FA Francesco Asnicar

SM Sagun Maharjan

AM Ana Mailyan

PM Paolo Manghi

MS Matthias Scholz

AT Andrew Maltez Thomas

MV Mireia Valles-Colomer

GW George Weingart

YZ Yancong Zhang

MZ Moreno Zolfo

CH Curtis Huttenhower

EF Eric A Franzosa

NS Nicola Segata

This method is extracted from research article: eLife, May 2021

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3

DOI: 10.7554/eLife.65088

Request a Protocol

Ask a question

Favorite

We developed the ChocoPhlAn pipeline to organize microbial reference genomes according to their taxonomy and to compute the relevant sequence and annotation data for subsequent bioBakery modules. At a high level, after retrieval of UniProt genomes and gene annotations, species-specific pangenomes (i.e. the set of gene families of a species present in at least one of its genomes) are generated using all the microbial reference genomes passing initial quality control. Core genomes (i.e. gene families present in all the genomes of a species) are then identified from the whole set of pangenomes and used as markers in PhyloPhlAn 3. Core genomes are also processed for the extraction of unique marker genes (i.e. core gene families uniquely associated with one species) that constitute the marker database for MetaPhlAn 3 and StrainPhlAn 3. Finally, functionally annotated pangenomes are processed to serve as references for PanPhlAn 3 and HUMAnN 3.

ChocoPhlAn relies on the UniProt core data resources (The UniProt Consortium, 2019) (release January 2019) and on the NCBI taxonomy and genomes repositories (NCBI Resource Coordinators and Coordinators, 2014) (release January 2019). The two basic sequence data types considered in ChocoPhlAn are the raw genomes of all available microbes and all the microbial proteins/genes identified on these genomes. The main supporting structure for a genome is the underlying microbial taxonomy, whereas the microbial proteins are organized in protein families clustered at multiple stringency parameters.

We adopted the NCBI taxonomy database (NCBI Resource Coordinators and Coordinators, 2014) for use by ChocoPhlAn as it is the one on which our genomic repository, UniProt, is also based. The full taxonomy was downloaded from the NCBI FTP server (ftp.ncbi.nlm.nih.gov/pub/taxonomy/) on January 24 2019. We identified and tagged species with ‘unidentified’, ‘sp.”, ‘Candidatus’, “bacterium “, and several other keywords as low-quality species. Specifically, the regular expressions used to filter low-quality taxonomic annotations are:

“(C|c)andidat(e|us) | _sp(_.*|$) | (.*_|^)(b|B)acterium(_.*|) |. *(eury|)archaeo(n_|te|n$).* |. *(endo|)symbiont.* |. *genomosp_.* |. *unidentified.* |. *_bacteria_.* |. *_taxon_.* |. *_et_al_.* |. *_and_.* |. *(cyano|proteo|actinobacterium_.*)”

All reference genomes available through UniProt Proteomes and linked to the public DDBJ, ENA, and GenBank repositories were then considered. Genomes are included by UniProt into UniProt Proteomes only if they are fully annotated and have a number of predicted CDSs falling within a statistically defined range of published proteomes from neighbouring species (What are proteomes, 2020). We considered all UniProt Proteomes genomes assigned to the archaeal and bacterial domain. For micro-eukaryotes, we considered all genomes assigned to the following manually selected genera: Blastocystis, Candida, Saccharomyces, Cryptosporidium, Entamoeba, Aspergillus, Cryptococcus, Cyclospora, Cystoisospora, Giardia, Leishmania, Malassezia, Neosartorya, Pneumocystis, Toxoplasma, Trachipleistophora, Trichinella, Trichomonas, and Trypanosoma.

Reference genomes (‘fasta’ format, suffix ‘.fna’) and the associated genomic annotation (‘GFF’ format, suffix ‘.gff’) of each proteome were downloaded from the NCBI GenBank FTP server (ftp.ncbi.nlm.nih.gov/genomes/all/GCA) by retrieving URLs from the assembly_summary_genbank.txt file (ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt) using the GCA accession included in the UniProt Proteomes resource (01/24/2019). Starting from a total of 111,825 UniProt Proteomes entries, we discarded 12,598 proteomes missing the GenBank accession, ending up with 99,227 genomes (997 Archaea, 97,941 Bacteria, 339 Eukaryota).

The microbial proteins (and genes) associated to at least one UniProt Proteome and considered by ChocoPhlAn are retrieved from the UniProt Knowledgebase (UniProtKB) and the UniProt Archive (UniParc) databases. Proteins included in UniProtKB have been derived from the translation of the CDSs of all available reference genomes included in UniProt Proteomes. ChocoPhlAn 3 also retrieves and includes relevant data present in the UniProtKB entries (retrieved from ftp.uniprot.org/pub/databases/uniprot/ as XML files uniprot_sprot.xml.gz, uniprot_trembl.xml.gz, uniparc_all.xml.gz) such as functional, phylogenomic, and protein domain annotations (KEGG, KO, EggNOG, GO, EC, Pfam) (El-Gebali et al., 2019; Huerta-Cepas et al., 2016; Kanehisa and Goto, 2000; The Gene Ontology Consortium, 2019), accessions for cross-referencing entries with external databases (GenBank, ENA, and BioCyc) (Clark et al., 2016; Karp et al., 2019; Leinonen et al., 2011), name of the gene that encodes for the protein, and proteome accession.

We processed a total of 203.9M proteins included in both UniProtKB and UniParc, and 126.9M of them were associated with a UniProt Proteome entry. The Bacteria domain tallied the highest number of proteins (194.8M), whereas Archaea and Eukaryotes accounted for 5.0M and 4.0M proteins, respectively.

In order to reduce the redundancy of the database, we use the UniRef90 clustering of UniProtKB proteins provided by UniProt. In brief, UniProtKB are clustered at different thresholds of sequence identity (100, 90, 50) and made available through the UniProt Reference Clusters (UniRef) resource (Suzek et al., 2015). UniRef90 clusters are generated by clustering unique sequences (UniRef100, which combines identical UniProtKB proteins in a single cluster) via CD-HIT (Li and Godzik, 2006) until August 2019, and via MMseqs2 (Steinegger and Söding, 2018) afterward. Sequences in UniRef90 clusters have at least 90% sequence identity (Suzek et al., 2015). UniRef50 clusters are generated by clustering the UniRef90 cluster seed sequences, and each cluster contains proteins with at least 50% identity. Both UniRef90 and UniRef50 require each protein to overlap at least 80% with the cluster's longest sequence. UniRef entries considered in ChocoPhlAn 3 contain the sequence of a representative protein, the accession IDs of all the entries included in the cluster, the accessions to the UniProtKB and UniParc records, and the accessions of the other associated UniRef cluster are included in the UniProt entries.

A total of 292.1M UniRef clusters were processed (172.3M, 87.3M, and 32.5M for UniRef100, UniRef90, and UniRef50, respectively) and associated with each protein and each genome in ChocoPhlAn 3.

We then generate pan-proteomes for each species represented at least by one UniProt Proteome. We define a species’ pan-proteome as the non-redundant representation of the species’ protein-coding potential. These are obtained for each species by considering the unique UniRef90 and UniRef50 protein families present in the genomes assigned at the species level and below.

For each pan-protein, we compute several scores. We define a ‘coreness’ score for a UniRef90 family as the number of genomes included in the species’ pan-proteome having a protein belonging to the UniRef family, and the ‘uniqueness’ score as the number of pan-proteomes of other species possessing the same pan-protein. We then also considered a ‘uniqueness_sp’ score, a variant of the ‘uniqueness’ score obtained excluding those species that were previously tagged as low-quality species. Alongside the ‘uniqueness’ score, we compute the ‘external_genomes’ as the number of genomes (rather than species or species’ pan-proteomes) of other species’ pan-proteomes possessing the same pan-protein. These scores were computed for both UniRef50 and UniRef90 protein families.

In ChocoPhlAn 3 we consider a total of 22,096 species’ pan-proteomes and a total of 87.3M UniRef90 core proteins (i.e. with coreness >0.7, avg. 3,952 s.d. 6311 per species).

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol