2.7. Amplicon sequence variant (ASV) determination and statistical analysis

Jayne E. Rattray; Anirban Chakraborty; Gretta Elizondo; Emily Ellefson; Bernie Bernard; James Brooks; Casey R. J. Hubert

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.7. Amplicon sequence variant (ASV) determination and statistical analysis

JR Jayne E. Rattray

AC Anirban Chakraborty

GE Gretta Elizondo

EE Emily Ellefson

BB Bernie Bernard

JB James Brooks

CH Casey R. J. Hubert

This method is extracted from research article: Geobiology, Aug 2022

Endospores associated with deep seabed geofluid features in the eastern Gulf of Mexico

DOI: 10.1111/gbi.12517

Ask a question

Favorite

Paired‐end Illumina MiSeq reads corresponding to before and after heating of the 42 sediment samples were prepared by trimming and filtering demultiplexed fastq files. ASVs defined as amplified DNA sequences that are identical to each other (no mismatches) were then obtained from paired‐end raw reads using the Divisive Amplicon Denoising Algorithm 2 (DADA2) open‐source web‐based bioinformatics pipeline (Callahan et al., ²⁰¹⁶). ASV data were processed and analysed using R package version 4.2.1 (RStudio Team, ²⁰²⁰). A Shapiro–Wilk test showed ASV sequences and endospore abundances were not normally distributed over the data sets, and all subsequent analysis was performed using non‐parametric statistical analyses, including Phyloseq (McMurdie & Holmes, ²⁰¹³), ggplot2 (Wickham, ²⁰¹⁶) and vegan (Oksanen et al., ²⁰¹⁴).

Due to the compositional nature of high‐throughput sequencing data, the absolute abundance of DNA molecules originally existing in the environment cannot be determined using nucleic acid sequencing (Gloor et al., ²⁰¹⁷). Numbers of sequence reads therefore represent the proportion of a given ASV within a sample. Differences in total counts observed or sample read depth can influence the proportion of ASVs per sample and lead to spurious associations. Errors introduced when using read depth can be mitigated by applying ‘rarefaction’ or read count subsampling to a common read depth (Lozupone et al., ²⁰¹¹; Wong et al., ²⁰¹⁶). Disadvantages of rarefaction data set subsampling include substantial loss of information (McMurdie & Holmes, ²⁰¹⁴) and removal of rare taxa. Rarefaction of the dataset obtained here would result in >60% of sequences being disregarded. Therefore, to identify novel, often rare, Firmicutes sequences in the heated sediment incubations, ASV counts were determined from the raw read data and plotted as centred log ratios (CLR; Gloor et al., ²⁰¹⁷), which is a useful alternative to rarefaction (Aitchison, ¹⁹⁸²). A caveat of CLR is that during log transformation information on the precision of the data is lost; however, the ratio remains the same irrespective of whether the data came from a large or small number of reads in a given sample library (Gloor et al., ²⁰¹⁷). Data set read count zero values were replaced with the calculated median of the sample raw reads, which operates as a pseudo‐count prior to log transformation. Thus, instead of assigning an arbitrary value of one or zero counts, the pseudo‐count method is modelled to the variability in each sample (Kaul et al., ²⁰¹⁷).

This is an open access article under the terms of the http://creativecommons.org/licenses/by/4.0/ License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol