Construction of artificial genomes under different scenarios

Askarbek Orakov; Anthony Fullam; Luis Pedro Coelho; Supriya Khedkar; Damian Szklarczyk; Daniel R. Mende; Thomas S. B. Schmidt; Peer Bork

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Construction of artificial genomes under different scenarios

AO Askarbek Orakov

AF Anthony Fullam

LC Luis Pedro Coelho

SK Supriya Khedkar

DS Damian Szklarczyk

DM Daniel R. Mende

TS Thomas S. B. Schmidt

PB Peer Bork

This method is extracted from research article: Genome Biol, Jun 2021

GUNC: detection of chimerism and contamination in prokaryotic genomes

DOI: 10.1186/s13059-021-02393-0

Request a Protocol

Ask a question

Favorite

Artificial genomes were constructed to simulate different scenarios of genome contamination and reference representation (see Fig. Fig.2a).2a). All simulations were performed using genomes in the curated and taxonomically annotated proGenomes 2.1 database [34], serving as a baseline for clean, in-reference genomes (“type 1” in Fig. Fig.2a).2a). Further simulation scenarios are described below. Unless otherwise indicated, simulations were conducted separately for each taxonomic level and at contamination portions of 5%, 10%, 15%, 20%, 30%, 40%, and 50%, with 3000 iterations/genomes per each taxonomic level and contamination portion. In each simulated genome, source genome contigs were randomly fragmented such that contig size was inversely proportional to contig frequency, parameterized based on the empirical frequency-size distributions of MAGs in the Pasolli, Almeida, and Nayfach datasets [13–15]. Simulated genomes were then generated from these simulated contigs based on the rules set out below:

Type 1: Clean (non-contaminated) genomes, in reference. Taken from progenomes2.1.

Type 2: Clean (non-contaminated) genomes, out of reference. Simulated by removing a genome’s entire source lineage from the reference.

Type 3a: Binary chimeric genome from two sources, both in reference. Simulated by randomly selecting “donor” and “acceptor” genomes whose lineages diverged at any of the seven tested taxonomic levels (divergence levels). A fraction of the acceptor genome was either replaced by a matching fraction of donor genome (to simulate non-redundant contamination), or the corresponding fraction of donor genome was added to the complete recipient genome (to simulate redundant contamination).

Type 3b: Chimera of multiple (3, 4, or 5) source genomes, all in reference. Source genomes from different source clades were mixed at equal shares totaling 1 altogether, e.g., $\frac{1}{3}$ , $\frac{1}{4}$ , or $\frac{1}{5}$ each.

Type 4: Binary chimera, both source lineages out of reference at subordinate levels. Source lineage clades removed at subordinate levels (e.g., genus or family) but sister clades retained in reference within the same parent clades (e.g., class or phylum), so that both higher-level source clades were represented at divergence level. Simulated 10,000 times for each taxonomic and contamination level.

Type 5a: Binary chimera, one source lineage in reference, one out of reference at divergence level. Recipient genome (in reference) partially replaced by donor genome (out of reference at divergence level).

Type 5b: Binary chimera, both source lineages out of reference at divergence level, e.g., no genome available from entire clades (at divergence level) containing source genomes.

To check for potential performance bias due to the selected reference set and taxonomy an additional round of simulations was done where genomes from GTDB v95 [2] were used for simulation instead of proGenomes2.1. For this purpose, an alternative GUNC reference set based on GTDB species-representative genomes was generated. Other than these differences, every aspect of this additional simulation was equivalent to the original simulation. We also confirmed that optimal GUNC CSS cutoff values with a GTDB reference did not differ significantly from those originally defined with a proGenomes-based reference.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol