MetaPhlAn relies on a set of unique and species-specific nucleotide markers that were updated in MetaPhlAn 3 starting from the ChocoPhlAn 3 pan-proteomes. We initially filtered out species having taxonomies previously tagged as low quality using the species-level genome bin (SGB) system (Pasolli et al., 2019). ‘Low-quality’ species that were assigned to the same SGB were merged and only the representative SGB was taken into account.
This merging procedure occurred for a total of 1328 species (6%) that were merged as they were unlikely to be distinguishable in metagenomic samples and would potentially lead to false-positive taxonomic assignments (see Supplementary file 7 for the merged species). For the cases in which multiple species included by the NCBI taxonomy into a ‘species-group’ showed a high number of markers with a high ‘uniqueness’ score (>30), we proceeded to identify unique markers for the whole species groups. This occurred for the following species groups: Streptococcus anginosus group, Lactobacillus casei group, Bacillus subtilis group, Enterobacter cloacae complex, Pseudomonas syringae group, Pseudomonas stutzeri group, Pseudomonas putida group, Pseudomonas fluorescens group, Pseudomonas aeruginosa group, Streptococcus dysgalactiae group, and Bacillus cereus group. In all these cases, the pangenomes were built by merging all the species-level pangenomes and treating them as a single species.
In the first step of the marker discovery procedure, we use the pan-proteome built using the UniRef90 clusters considering all proteins with a length between 150 and 1500 amino acids. Starting from the coreness and uniqueness scores, we applied an iterative approach in order to find up to 150 unique markers whenever possible and retaining only those species with a minimum of 10 unique markers. We classify candidate markers into unique and quasi-markers according to the ‘uniqueness’ value: markers having zero ‘uniqueness’ are reported as ‘unique markers’. When no unique markers can be identified, the less-stringent thresholds used in the marker discovery procedure allows the identification of the so-called ‘quasi-markers’, markers having non-null values of ‘uniqueness’.
The iterative approach started with the definition of four tiers of unique markers according to a combination of the values of ‘coreness’, ‘uniqueness’, and ‘external_genomes’. Tier ‘A’ includes pan-proteins with a coreness score higher than 80%, not shared with more than two other pan-proteomes considering both UniRef90 and UniRef50 clustering score (‘Uniqueness_NR90’ and ‘Uniqueness_NR50’), and not present in more than 10 single genomes when considering the UniRef90 and 5 single genomes when considering UniRef50 (‘External_genomes_NR90’ and ‘External_genomes_NR50’), respectively. Tier ‘B’ includes markers with ‘coreness’ values between 70% and 80%, ‘Uniqueness_NR90’, and ‘Uniqueness_NR50’ values of 5, and values of ‘External_genomes_NR90’ and ‘External_genomes_NR50’ lower than 15 and 10 genomes, respectively. Markers that did not meet the previous criteria were included in the ‘C’ tier, which includes markers with ‘coreness’ values between 50% and 70%, ‘Uniqueness_NR90’ less than 10, ‘Uniqueness_NR50’ less than 15, ‘External_genomes_NR90’ less than 25, and ‘External_genomes_NR50’ less than 20. Markers for the species having only one genome included in the pan-proteome, for which the definition of coreness is trivial, were classified as tier ‘U’, provided that they have zero ‘Uniqueness’.
The definition of specific tiers allows the retrieval of the maximum number of unique markers. Marker discovery procedure was performed iteratively for each tier. Candidate markers that meet the tier-defined thresholds were ranked using a score function defined as follows:
Where
The score function as defined will prioritize the selection of candidate markers highly conserved in the clade (high ‘coreness’ value) but shared with the smallest possible number of other species (low values of ‘uniqueness’). Tier type is assigned to each candidate marker, and if more than 50 candidate markers were identified, we selected up to 150 markers from the ranked list. If not enough markers were identified (less than 50), the procedure was repeated using the subsequent tier’s thresholds. If no markers were identified using tier C thresholds, the species was discarded.
Nucleotide sequences for each marker selected with this procedure are then considered as entries for the MetaPhlAn database. To refine the number of species estimated by the ‘uniqueness’ parameter, marker sequences were split into non-overlapping chunks of 150 bp and mapped versus an index built using all the reference genomes used for the marker identification process using bowtie2 (version 2.3.4.3, parameters ‘-a --very-sensitive --no-unal --no-hq --no-sq’). We accounted for a newly identified species based on the ‘uniqueness’ parameter if at least 150 consecutive nucleotides of the marker sequence were found in the identified target reference genome.
We performed an additional step of curation for markers for species with genomes obtained with Co-Abundance gene Groups (CAGs) (MetaHIT Consortium et al., 2014). To reduce the number of false positives, we removed the CAG species if more than 50% of its markers were shared with the species that gave the taxonomy to the CAG genome.
Each marker has associated an entry in the MetaPhlAn database which includes the species for which the sequence is a marker, the list of species sharing the marker, the sequence length, and the taxonomy of the species. Viral markers were taken from the v20_m200 MetaPhlAn 2 database.
Altogether, this identified a total of 1.1M markers for 13,475 species (Supplementary file 8).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.