Different tools and statistical packages were used to preprocess the sequence matrices generated to determine taxon–taxon, taxon–metadata and total counts-metadata correlations. As reference, these correlations were calculated from the original simulated taxonomy matrix.
Relative transformations (not addressing metagenomic data compositionality)
Relative microbiota profiling (RMP). RMP matrices were obtained by rarefying all samples in each sequence matrix to an even sequencing depth (the minimum sample total read count of the matrix). This method is used as implemented in the R package phyloseq (v1.34.0)29.
Relative proportions (Rel). Absolute counts from metagenomic sequencing were converted to relative proportions by dividing each taxon abundance by the total taxa abundance in a sample.
Total sum scaling and arcsine squared transformation (AST). In this method, each taxon count is first scaled dividing by the corresponding sample total counts (total sum scaling, TSS), and the arcsine transform of the scaled values is computed23. The method, used in several microbiome research publications15,30, was used with a custom implementation. Since it is a direct transformation of the relative proportions (Rel), this method was classified as a relative transformation.
Compositional transformations (computational approaches to bypass data compositionality)
Centered log-ratio transformation (CLR). The log ratio of each taxon counts to the geometric mean of all taxa in a sample is computed in this approach. Prior to the transformation, zero’s in the sequence matrix were imputed by Bayesian multiplicative replacement (implemented in the R package zCompositions (v1.3.4)31). This method was used as implemented in the CoDaSeq R package (v0.99.6)6.
Cumulative sum scaling (CSS). In this method, taxon counts are divided by the cumulative sum of counts of each sample, up to a percentile determined ad-hoc for each dataset, based on the data distribution. The method was used as implemented in the R package metagenomeSeq (v1.32.0)32
Geometric mean of pairwise ratios (GMPR). This method is used to calculate a scaling factor to normalize the samples. It first computes the median of all pairwise ratios between any two samples, using only non-zero values. The scaling factor of a sample is then calculated as the geometric mean of the median values calculated for that sample and all of the other samples in the dataset. The method was used as implemented in the GMPR R package (v0.1.3)33.
Trimmed mean of M-values (TMM). In this method, the authors defined the M values as the log-ratio between the relative abundance of each gene (or taxon) g in a given sample and in a reference sample. To choose a reference sample, the sample whose upper quartile is closest to the mean upper quartile of all the samples tested is used. For each non-reference sample, the M values for all genes/taxa are calculated and the extremes are trimmed. The mean of the remaining M values is used as scaling factor for the normalization34. The method was used as implemented in the edgeR package (v3.32.1)13.
Upper quantile normalization (UQ). For this normalization, scaling factors are calculated from the 75% quantile of the counts for each sample, after removing taxa abundances that are zero, and scaled by sequencing depth. The method was used as implemented in the edgeR package (v3.32.1)13.
Relative log expression (RLE). In this method, the geometric mean of each taxon across all samples is calculated. The median ratio of each sample to the vector of geometric means (excluding zeros) is used as scaling factor for normalization. The method was used as implemented in the edgeR package (v3.32.1)13.
Variance-stabilizing transformation (VST). In this method, taxa counts are scaled by their corresponding library size factors (calculated similarly as in RLE) and a variance-stabilizing transformation is applied that considers the relationship between the dispersion and the mean. The method was used as implemented in the DESeq2 R package (v1.30.1)14.
Quantitative transformations (experimental approaches to bypass data compositionality)
Quantitative microbiota profiling (QMP). In this method, samples are first rarefied to even sampling depth. Sampling depth, not to be confused with sequencing depth, represents the fraction of the actual observed microbiota in a sample. It can be defined as the ratio between sequencing depth (here taken as the total number of sequencing reads that are assigned to any taxa in a sample) and the total microbial load per gram of the original sample. QMP matrices were generated by rarefying (randomly subsetting) the sequence matrices to even sampling depth considering their synthetic microbial loads, then scaling them by multiplying each sample by its estimated microbial load, as implemented in the original publication4.
Absolute counts scaling (ACS). The ACS matrices were derived as previously reported7,16, i.e., by directly multiplying the relative sequencing counts of each sample by their estimated microbial loads.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.