We measured sequencing quality based on 5 metrics: number of reads obtained from a sample, GC content, Shannon’s entropy of k-mers, post PCR Qubit score, and recorded DNA concentration before PCR. The number of reads in each sample was counted both before and after quality control, we used the number of reads after quality control for our results though the difference was slight. GC content was estimated from 100,000 reads in each sample after low quality DNA and human reads had been removed. Shannon’s entropy of k-mers was estimated from 10,000 reads taken from each samples. PCR Qubit score and DNA concentration are described in the wet lab methods.

We observed good separation of negative and positive controls based on both PCR Qubit and k-mer entropy. Distributions of DNA concentration and the number of reads were as expected (Figure S2G, H, I). GC content was broadly distributed for negative controls while positive controls were tightly clustered, expected since positive controls have a consistent taxonomic profile. Comparing the number of reads before and after quality control did not reveal any major outliers.

Batch effects are a major concern for this low-biomass study and any large-scale study. The median flowcell used in our study contained samples from 3 cities and 2 continents. However, two flowcells covered 18 cities from 5 or 6 continents respectively. When samples from these flowcells were plotted using UMAP (see global diversity varies according to key covariates for details) the major global trends we described were recapitulated (Figure S2F). Plots of the number of reads against region (Figure S2G) showed a stable distribution of reads across cities. Analogous plots of PCR Qubit scores were less stable than the number of reads but showed a clear drop for control samples (Figure S2H). These results led us to conclude that batch effects are likely to be minimal.

We used BLASTn to align nucelotide assemblies from case samples to control samples. We used a threshold of 8,000 base pairs and 99.99% identity as a minimum to consider two sequences homologous. This threshold was chosen to be sensitive without solely capturing conserved regions. We identified all connected groups of homologous sequences and found approximate taxonomic identifications by aligning contigs to NCBI-NT using BLASTn searching for 90% nucleotide identity over half the length of the longest contig in each group.

Despite good separation of positive and negative controls (see STAR Methods) we identified several species in our negative controls which were also identified as prominent taxa in the data-set as a whole (See a core urban microbiome centers global diversity). Our dilemma was that a microbial species that is common in the urban environment might also reasonably be expected to be common in the lab environment. In general, negative controls had lower k-mer complexity, fewer reads, and lower post PCR Qubit scores than case samples and no major flowcell specific species were observed. Similarly, positive control samples were not heavily contaminated. These results suggest samples are high quality but do not systematically exclude the possibility of contamination.

Previous studies have reported that microbial species whose relative abundance is negatively correlated with DNA concentration may be contaminants. We observed a number of species that were negatively correlated with DNA concentration but this distribution followed the same shape as a null distribution of uniformly randomly generated relative abundances leading us to conclude that negative correlation may simply be a statistical artifact.

We analyzed the total complexity of case samples in comparison to control samples. Case samples had a significantly higher taxonomic diversity (Figure S2I) than any type of negative control sample. We also compared the confidence of taxonomic assignments to control assignments for prominent taxa using the number of unique marker k-mers to compare assignments. We found that case samples had more and higher quality assignments than could be found in controls. In contrast, the taxonomic assignment of one species, Bradyrhizobium sp. BTAi1, was not clearly more accurate in case samples than controls. Nevertheless, we were able to assemble genomes for this species in several unique samples, so we feel the species is not definitively a negative control contaminant.

Finally, we compared assemblies from negative controls to assemblies from our case samples searching for regions of high similarity that could be from an identical microbial strain. We reasoned that uncontaminated samples may contain the same species as negative controls but were less likely to contain identical strains. Only 137 case samples were observed to have any sequence with high similarity to an assembled sequence from a negative control (8,000 base pairs minimum of 99.99% identity). The identified sequences were principally from Bradyrhizobium and Cutibacterium. Since these genera are core taxa (See a core urban microbiome centers global diversity) observed in nearly every sample but high similarity was only identified in a few samples, we elected not to remove species from these genera from case samples.

We generated 31-mer profiles for raw reads using Jellyfish. All k-mers that occurred at least twice in a given sample were retained. We also generated MASH sketches from the non-human reads of each sample with 10 million unique minimizers per sketch. We calculated the Shannon’s entropy of k-mers by sampling 31-mers from a uniform 10,000 reads per sample.

We found clear correlations between k-mer based Jaccard distance (MASH) and taxonomic Jaccard distance (Figure S2A). We also compared alpha diversity metrics (Figure S2B): Shannon entropy of k-mers, and Shannon entropy of taxonomic profiles. As with pairwise distances these metrics were correlated though noise was present. This noise may reflect sub-species taxonomic variation in our samples.

A large proportion of the reads in our samples were not mapped to any reference sequence. There are three major reasons why a fragment of DNA would not be classified in our analysis 1) The DNA originated from a non-human and non-microbial species which would not be present in the databases we used for classification 2) Our classifier (KrakenUniq) failed to classify a DNA fragment that was in the database due to slight mismatch 3) The DNA fragment is not represented in any existing database. Explanations (1) and (2) are essentially drawbacks of the database and computational model used, and we can quantify them by mapping reads using a more sensitive aligner to a larger database, such as BLASTn (Altschul et al., 1990), or ensemble methods for analysis (McIntyre et al., 2017). To estimate the proportion of reads which could be assigned, we took 10k read subsets from each sample and mapped these to a set of large database using BLASTn (see a core urban microbiome centers global diversity for details). This resulted in 34.6% reads which could not be mapped to any external database compared to 41.3% of reads mapped using our approach with KrakenUniq. We note that our approach to estimate the fraction of reads that could be classified using BLASTn does not account for hits to low quality taxa which would ultimately be discarded in our pipeline, and so represents a worst-case comparison. Explanation (3) is altogether more interesting and we refer to this DNA as true unclassified DNA. In this analysis we do not seek to quantify the origins of true unclassified DNA except to postulate that it may derive from previously unknown species as have been identified in other similar studies.

Note: The content above has been extracted from a research article, so it may not display correctly.

Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.

We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.