Advanced Search
Published: Apr 20, 2022 DOI: 10.21769/BioProtoc.4395 Views: 1460
Edited by: Sonny Lee Reviewed by: Damián Lobato-MárquezPrashanth N Suravajhala
Abstract
Many research questions in plant science depend on surveys of the plant microbiome. When the questions depend on answering "who is there" rather than "what are they doing," the most cost-effective method for interrogating microbiomes is often via targeted meta-amplicon surveys. There are numerous platforms for processing and analyzing meta-amplicon data sets, but here we will look at a flexible, reproducible, and fully customizable pipeline in the R environment. This approach has the benefit of being able to process and analyze your data in the same setting, without moving back and forth between standalone platforms, and further benefits from being completely flexible in terms of analyses and visualizations you can produce, without being limited to pre-selected tools available in point-and-click analysis engines, such as QIIME, Galaxy, or MG-RAST.
Keywords: Exact sequence variantsBackground
It is increasingly common to find oneself needing to take a snapshot of microbial community structure in a given environment. From tracking key microbial members of the human gut, to comparing pathogen effects on crop-associated mutualistic bacteria, or testing the effect of probiotic applications in preventing disease in endangered plants, this is a common and powerful approach for gathering data (Busby et al., 2016; Zahn and Amend, 2017; Darcy et al., 2020). Whether your question is centered on the microbes or the host, you may find yourself with cost-effective, confusing, and potentially useful meta-amplicon data.
Typically, these data are obtained by sequencing either 16S (for bacteria) or ITS (for fungi) amplicons, which have been tagged with sample-specific barcodes. Hundreds of samples can be simultaneously sequenced on a single Illumina flowcell in this manner, making this a useful method for addressing the question "who is there?", and require high replication for statistical inferences. A sequencing center will typically return demultiplexed data, where each sample is in its own file. It's at this point a researcher suddenly realizes they don't know what to do with all of these files.
Here, I describe a workflow that (with minor adaptations) can be used for bacteria or fungi. The supplementary materials contain the exact code scripts, with all the documentation and lots of comment lines, but here we describe exactly each step, as we move from raw sequence files to hypothesis testing. Many of the tools and algorithms used in this workflow are quite complicated, so I have made an effort to keep explanations simple and at a surface level. More detailed explanations of each tool are best left to the associated publications.
The workflow presented here utilizes an "amplicon sequence variant" (ASV) instead of an operational taxonomic unit (OTU) approach. OTUs rely on clustering similar sequences, typically with random seeds, and a pre-defined set of similarity thresholds. ASVs take error-corrected sequences as the de facto unit of observation. The advantages of ASVs include the potential to uncover diversity that could otherwise be masked by lumping sequences together, and enabling fully reproducible results (Tipton et al., 2022). Rather than each taxonomic unit being denoted by a pseudo-random cloud of sequences, an ASV is an exact variant that can be compared across multiple studies.
The standard approach for any meta-amplicon workflow can be broken into two main tasks: 1) turning raw data into a sequence abundance table, and 2) using that table (with associated metadata) to generate and test hypotheses about microbial communities. These tasks have very different pitfalls to be aware of. The first task is much more formulaic, with only a few parameters that typically need to be tuned for each new study. The second task is more open-ended, but carries all the usual caveats that come with data exploration and hypothesis testing. This protocol will focus on the first task of generating a sequence abundance table from raw meta-amplicon reads, and will then give an example workflow that explores alpha- and beta-diversity, and looks for differentially abundant taxa between sample groups. This example will serve as a useful demonstration of how to go about exploring the derived dataset that this workflow generates, but is specific to the example data presented here.
Software and Data sets
Software
cutadapt (Martin, M., 2011; version 2.10; https://cutadapt.readthedocs.io/en/stable/changes.html#v2-1-2019-03-15)
ITSxpress (Rivers et al., 2018; version 1.0; https://github.com/usda-ars-gbru/itsxpress)
R (R Core Team, 2017; version 3.6.3; https://www.R-project.org/)
tidyverse R package (Wickham et al., 2019; version 1.3.0; https://www.tidyverse.org/)
DADA2 R package (Callahan et al., 2016; version 1.14.1; https://benjjneb.github.io/dada2/)
decontam R package (Davis et al., 2018; version 1.6.0; https://benjjneb.github.io/decontam/)
phyloseq R package (McMurdie and Holmes, 2013; version 1.30.0; https://joey711.github.io/phyloseq/)
Biostrings R package (Pagès et al., 2021; version 2.54.0; http://www.bioconductor.org/packages/release/bioc/html/Biostrings.html)
phangorn R package (Schliep, 2011; version 2.2.5; https://CRAN.R-project.org/package=phangorn)
msa R package (Bodenhofer et al., 2015; version 1.18.0; https://bioconductor.org/packages/release/bioc/html/msa.html)
ShortRead R package (Morgan et al., 2009; version 1.44.3; https://bioconductor.org/packages/release/bioc/html/ShortRead.html)
corncob R package (Martin, B. D. et al., 2020; version 0.2.0; https://CRAN.R-project.org/package=corncob)
vegan R package (Okansen et al., 2016; version 2.6.0; https://CRAN.R-project.org/package=vegan)
patchwork R package (Pedersen, 2020; version 1.0.1; https://CRAN.R-project.org/package=patchwork)
Input data
Input data are the demultimlexed fastq reads from sequenced bacterial 16S amplicons. There should be two files for each experimental sample that was sequenced; a file for forward reads, and a file for reverse reads. This workflow is written for the commonly used Illumina sequencing platform, but IonTorrent sequences can be used with some slight modification (noted in the workflow).
Example data consists of custom subsets of host-associated bacterial 16S amplicons taken from 46 seagrass samples (including PCR negatives). Files have been truncated and modified from their original form to allow for faster computational times, while still showing significant differences in community composition between sampling groups. They no longer represent any real study system, but are meant to be used as a teaching tool. Raw and intermediate data are provided in the GitHub repository associated with this publication.
Procedure
Category
Microbiology > Microbe-host interactions > Bacterium
Microbiology > Community analysis > Metagenomics
Biological Sciences > Biological techniques > Microbiology techniques
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Share
Bluesky
X
Copy link