Aggregation and normalization methods

AC Anna S. E. Cuomo
GA Giordano Alvari
CA Christina B. Azodi
DM Davis J. McCarthy
MB Marc Jan Bonder
request Request a Protocol
ask Ask a question
Favorite

Single-cell normalization and log2 transformation of normalized counts-per-million were performed on raw single-cell counts using scran (version: 1.14.1) [20] with size factors calculated using the pooled approach described in [21] and a pseudocount of one applied to the log2 transformation to avoid attempting to take the log of zero (i.e., log2(cpm + 1)). The normalized counts were then aggregated using either the mean or the median (Fig. 1). We also considered alternative single-cell normalization techniques, for both Smart-Seq2 and 10X we also used bayNorm (version: 1.8.0) [28] and for 10X we also considered sctransform (version: 0.3.2) [29], but observed minimal differences in normalized expression values (as assessed by Pearson’s correlation at the gene-level over the donor-run combinations, Additional file 1: Table S4) and in eQTL results after dr-mean aggregation (as assessed by overlapping eQTL effect size and p values). Sum aggregation was performed directly on the raw counts (Fig. 1) and followed by pseudo-bulk-like TMM normalization [1] and log2 transformation, as implemented in edgeR (version: 3.28.1) [30].

In all cases, aggregation was performed at two levels of batch (Tables 1 and 2). First, we aggregated all cells from each donor (i.e., d-mean, d-median, d-sum). In this setting, one sample corresponds to one donor (n = 87 in the iPSC Smart-Seq2 dataset, n = 174 in the FPP 10X data; samples with > 5 cells only). Note that all donors from the iPSC Smart-Seq2 dataset had > 5 cells per donor, so this threshold only applied when aggregating over donor run. In the 10X data, only five donors were filtered out by this filter (Additional file 1: Fig. S2c,d). In cases with low sequencing coverage or when poor sequencing quality is a concern, a higher minimum cell threshold may be needed, with the trade-off being fewer donors and thus less power for eQTL discovery [19]. Next, we aggregated separately across donors and sequencing runs (dr-mean, dr-median, dr-sum). In this second setting, one sample is a unique donor-sequencing run combination (when considering samples with > 5 cells, n = 155 in the iPSC Smart-Seq2 data, n = 702 in the FPP 10X data). Visually, the various aggregation methods show a similar picture across donors/samples and genes, with the median aggregations being most affected by the 0-inflated expression (as shown on the iPSC Smart-Seq2 data in Additional file 1: Fig. S14, Additional file 1: Fig. S15).

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A