GUNC computes several scores to quantify a query genome’s quality, its representation in the GUNC reference database and its levels of putative contamination. The GUNC clade separation score (CSS) is an entropy-based clustering measure to assess how homogeneously taxonomic clade labels (T) are distributed across a genome’s contigs (C). It is inspired by the uncertainty coefficient [35],

A simple estimator for this quantity is the plugin estimator where C is a set of contigs, T is a set of taxonomic clades, act is a number of genes located in contig c and assigned to taxonomic clade t, and N is the total number of genes in a genome.

However, this estimator is known to be biased when the number of samples is small [47] and adjusting it for chance leads to more interpretable quantities [48]. In our case, the sums range over the genes in each contig, and, in fragmented genomes, many contigs can contain only a small number of genes. Therefore, we normalize the estimated conditional entropy by the expected value of this estimation under a null model, leading to CSS = 1 − Ĥ(T|C)/Ĥ(T|R), where Ĥ(T|R) is the expected value of Ĥ(T|C) keeping the same contig size distribution and assuming no relationship between contig membership and taxonomic assignment (in the special case where Ĥ(T|C) > Ĥ(T|R), we set CSS to zero).

The CSS is 0 if the frequency distribution of taxonomic labels in every individual contig exactly follows that across the entire genome. It is 1 if all contigs are “taxonomically pure,” i.e., if the distribution of taxonomic labels follows contig boundaries. GUNC outputs CSS scores for every tested taxonomic level, so that users can infer the approximate phylogenetic depth at which source lineages diverged. By default, GUNC adjusts CSS to 0 at every level separately when the portion of called genes left after removal of minor clades, i.e., genes retained index < 0.4, because in that case there are too few remaining genes to calculate scores on at that level. Then, GUNC flags a genome as putatively contaminated if the “adjusted” CSS > 0.45 at any taxonomic level, a threshold benchmarked in a series of simulation scenarios (Additional file 1: Figure S12).

The CSS does not carry information about the scale of contamination (i.e., the fraction of contaminant genome), but about the confidence with which a query genome may be considered chimeric. In other words, the CSS assesses whether a genome is contaminated or not, but not how large the contaminant fraction is. GUNC instead quantifies the scale of contamination at each tested taxonomic level using two measures. The total fraction of genes with minority clade labels after filtering (“GUNC contamination”) is an estimate of the total fraction of contamination in the query genome. Note that this definition differs from that commonly used by tools such as CheckM: designed to quantify non-redundant contamination, GUNC scales by the total query genome size, whereas CheckM estimates redundant contamination by scaling against a theoretical “clean” source genome with a single set of SCGs. In practice, this means that GUNC contamination never exceeds 100%, whereas CheckM contamination estimates the number of (complete) surplus genomes. GUNC further provides a combined estimate of redundant and non-redundant contamination as the effective number of surplus clades (Teff) in a query genome, calculated as the Inverse Simpson Index minus 1 (as 1 genome is expected):

where pi is the fraction of genes assigned to clade i. Teff scales in [0, ∞] and can be interpreted as the number of surplus clades in the query genome considering the weighted contributions of all source lineages.

Finally, GUNC computes a reference representation score (RRS) based on the total fraction of genes mapping to the GUNC database (Portiongenes mapped), the fraction of genes retained after noise filtering by removing minority labels recruiting ≤ 2% of genes (see above; Portiongenes retained) and their average similarity to the reference (Identitymean):

The RRS captures the expectation that out-of-reference genomes will map to the reference to a lower degree (Portiongenes mapped) and at lower similarity (Identitymean). Moreover, among simulated out-of-reference genomes, we empirically observed a characteristic pattern of noisy, low confidence hits scattered unspecifically across multiple clades at very low frequencies; in the RRS, this signature is formalized as the term Portiongenes retained. High RRS values indicate that a query genome maps well within the GUNC reference space, whereas low RRS indicates poor reference representation to qualify the interpretation of CSS and contamination estimates. In general, the lower the RRS, the higher the risk of type 1 errors based on CSS (falsely labelling genomes as contaminated): this way, GUNC asserts that genome quality is only confidently estimated where sufficient data is available and that genomes potentially representing deeply branching novel lineages beyond the GUNC reference are flagged for further (manual) inspection.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.