Single-cell normalization and log2 transformation of normalized counts-per-million were performed on raw single-cell counts using scran (version: 1.14.1) [20] with size factors calculated using the pooled approach described in [21] and a pseudocount of one applied to the log2 transformation to avoid attempting to take the log of zero (i.e., log2(cpm + 1)). The normalized counts were then aggregated using either the mean or the median (Fig. 1). We also considered alternative single-cell normalization techniques, for both Smart-Seq2 and 10X we also used bayNorm (version: 1.8.0) [28] and for 10X we also considered sctransform (version: 0.3.2) [29], but observed minimal differences in normalized expression values (as assessed by Pearson’s correlation at the gene-level over the donor-run combinations, Additional file 1: Table S4) and in eQTL results after dr-mean aggregation (as assessed by overlapping eQTL effect size and p values). Sum aggregation was performed directly on the raw counts (Fig. 1) and followed by pseudo-bulk-like TMM normalization [1] and log2 transformation, as implemented in edgeR (version: 3.28.1) [30].
In all cases, aggregation was performed at two levels of batch (Tables 1 and 2). First, we aggregated all cells from each donor (i.e., d-mean, d-median, d-sum). In this setting, one sample corresponds to one donor (n = 87 in the iPSC Smart-Seq2 dataset, n = 174 in the FPP 10X data; samples with > 5 cells only). Note that all donors from the iPSC Smart-Seq2 dataset had > 5 cells per donor, so this threshold only applied when aggregating over donor run. In the 10X data, only five donors were filtered out by this filter (Additional file 1: Fig. S2c,d). In cases with low sequencing coverage or when poor sequencing quality is a concern, a higher minimum cell threshold may be needed, with the trade-off being fewer donors and thus less power for eQTL discovery [19]. Next, we aggregated separately across donors and sequencing runs (dr-mean, dr-median, dr-sum). In this second setting, one sample is a unique donor-sequencing run combination (when considering samples with > 5 cells, n = 155 in the iPSC Smart-Seq2 data, n = 702 in the FPP 10X data). Visually, the various aggregation methods show a similar picture across donors/samples and genes, with the median aggregations being most affected by the 0-inflated expression (as shown on the iPSC Smart-Seq2 data in Additional file 1: Fig. S14, Additional file 1: Fig. S15).
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.
Tips for asking effective questions
+ Description
Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.