Artificial genomes were constructed to simulate different scenarios of genome contamination and reference representation (see Fig. Fig.2a).2a). All simulations were performed using genomes in the curated and taxonomically annotated proGenomes 2.1 database [34], serving as a baseline for clean, in-reference genomes (“type 1” in Fig. Fig.2a).2a). Further simulation scenarios are described below. Unless otherwise indicated, simulations were conducted separately for each taxonomic level and at contamination portions of 5%, 10%, 15%, 20%, 30%, 40%, and 50%, with 3000 iterations/genomes per each taxonomic level and contamination portion. In each simulated genome, source genome contigs were randomly fragmented such that contig size was inversely proportional to contig frequency, parameterized based on the empirical frequency-size distributions of MAGs in the Pasolli, Almeida, and Nayfach datasets [13–15]. Simulated genomes were then generated from these simulated contigs based on the rules set out below:
Type 1: Clean (non-contaminated) genomes, in reference. Taken from progenomes2.1.
Type 2: Clean (non-contaminated) genomes, out of reference. Simulated by removing a genome’s entire source lineage from the reference.
Type 3a: Binary chimeric genome from two sources, both in reference. Simulated by randomly selecting “donor” and “acceptor” genomes whose lineages diverged at any of the seven tested taxonomic levels (divergence levels). A fraction of the acceptor genome was either replaced by a matching fraction of donor genome (to simulate non-redundant contamination), or the corresponding fraction of donor genome was added to the complete recipient genome (to simulate redundant contamination).
Type 3b: Chimera of multiple (3, 4, or 5) source genomes, all in reference. Source genomes from different source clades were mixed at equal shares totaling 1 altogether, e.g., , , or each.
Type 4: Binary chimera, both source lineages out of reference at subordinate levels. Source lineage clades removed at subordinate levels (e.g., genus or family) but sister clades retained in reference within the same parent clades (e.g., class or phylum), so that both higher-level source clades were represented at divergence level. Simulated 10,000 times for each taxonomic and contamination level.
Type 5a: Binary chimera, one source lineage in reference, one out of reference at divergence level. Recipient genome (in reference) partially replaced by donor genome (out of reference at divergence level).
Type 5b: Binary chimera, both source lineages out of reference at divergence level, e.g., no genome available from entire clades (at divergence level) containing source genomes.
To check for potential performance bias due to the selected reference set and taxonomy an additional round of simulations was done where genomes from GTDB v95 [2] were used for simulation instead of proGenomes2.1. For this purpose, an alternative GUNC reference set based on GTDB species-representative genomes was generated. Other than these differences, every aspect of this additional simulation was equivalent to the original simulation. We also confirmed that optimal GUNC CSS cutoff values with a GTDB reference did not differ significantly from those originally defined with a proGenomes-based reference.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.