To conduct the analysis, 12,602 genome sequences labeled either Escherichia or Shigella were downloaded from GenBank on 26 June 2018 using batch Entrez and the list of GCAs accession numbers from NCBI Genome database (including plasmid sequences when applicable). This dataset (Supplementary Data 1) was cleaned to obtain an informative and diverse set of 10,667 E. coli and Shigella genomes that captures the diversity of the species as sequenced to date. In addition to the GenBank genomes, a total of 125,771 read sets labeled as either E. coli or Shigella were downloaded from the SRA database. After cleaning the dataset, we utilized Mash21, a program that approximates similarity between two genomes in nucleotide content, and an in-house Python script to create a matrix of distances for all 10,667 genomes. This matrix was then clustered using hierarchical clustering after converting the Mash distance to a Pearson’s correlation coefficient distance, to ensure that clustering results were based on a genome’s overall similarity to the whole species.

To evaluate the quality of the dataset, various sequence quality scores were calculated as described by Land et al.44. Following the recommended quality score cutoff value of 0.8, the dataset was filtered to include only genomes with a total quality score of 0.8 or higher. Applying the same cutoff value to the sequence quality score alone resulted in an extremely restricted dataset that no longer addressed the goals of this study. Genome size was restricted to >3 Mb and <6.77 Mb, to remove questionably sized genomes, which could be due to contamination or modified genomes that are not representative of the natural E. coli species. After applying these two steps, 10,855 genomes remained in the assembled genome dataset for analysis.

To further clean the dataset, we filtered genomes that were outside the statistical distribution of Mash distances within the dataset. Assuming that Shigella species are all members of E. coli, we decided to use type strains for the Escherichia and Shigella genera (accession numbers GCA_000613265.1 and GCA_002949675.1, respectively) to quickly filter the set of 10,855 genomes for erroneous or low-quality genomes that may have slipped through the previous cleaning steps. The Mash values of the 10,855 genomes compared to each type strain were broken into percentiles ranging from 10% to 99.995%. A cutoff percentile of 98.5% was determined to provide sufficient cleaning without risking a large loss of data (Supplementary Data 1) and was applied to each type strain Mash value set. Genomes that were found in both sets after filtering were retained to produce the final dataset of 10,667 genomes.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.