Aggregation and normalization methods

Anna S. E. Cuomo; Giordano Alvari; Christina B. Azodi; Davis J. McCarthy; Marc Jan Bonder

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Aggregation and normalization methods

AC Anna S. E. Cuomo

GA Giordano Alvari

CA Christina B. Azodi

DM Davis J. McCarthy

MB Marc Jan Bonder

This method is extracted from research article: Genome Biol, Jun 2021

Optimizing expression quantitative trait locus mapping workflows for single-cell studies

DOI: 10.1186/s13059-021-02407-x

Request a Protocol

Ask a question

Favorite

Single-cell normalization and log2 transformation of normalized counts-per-million were performed on raw single-cell counts using scran (version: 1.14.1) [20] with size factors calculated using the pooled approach described in [21] and a pseudocount of one applied to the log2 transformation to avoid attempting to take the log of zero (i.e., log2(cpm + 1)). The normalized counts were then aggregated using either the mean or the median (Fig. 1). We also considered alternative single-cell normalization techniques, for both Smart-Seq2 and 10X we also used bayNorm (version: 1.8.0) [28] and for 10X we also considered sctransform (version: 0.3.2) [29], but observed minimal differences in normalized expression values (as assessed by Pearson’s correlation at the gene-level over the donor-run combinations, Additional file 1: Table S4) and in eQTL results after dr-mean aggregation (as assessed by overlapping eQTL effect size and p values). Sum aggregation was performed directly on the raw counts (Fig. 1) and followed by pseudo-bulk-like TMM normalization [1] and log2 transformation, as implemented in edgeR (version: 3.28.1) [30].

In all cases, aggregation was performed at two levels of batch (Tables 1 and 2). First, we aggregated all cells from each donor (i.e., d-mean, d-median, d-sum). In this setting, one sample corresponds to one donor (n = 87 in the iPSC Smart-Seq2 dataset, n = 174 in the FPP 10X data; samples with > 5 cells only). Note that all donors from the iPSC Smart-Seq2 dataset had > 5 cells per donor, so this threshold only applied when aggregating over donor run. In the 10X data, only five donors were filtered out by this filter (Additional file 1: Fig. S2c,d). In cases with low sequencing coverage or when poor sequencing quality is a concern, a higher minimum cell threshold may be needed, with the trade-off being fewer donors and thus less power for eQTL discovery [19]. Next, we aggregated separately across donors and sequencing runs (dr-mean, dr-median, dr-sum). In this second setting, one sample is a unique donor-sequencing run combination (when considering samples with > 5 cells, n = 155 in the iPSC Smart-Seq2 data, n = 702 in the FPP 10X data). Visually, the various aggregation methods show a similar picture across donors/samples and genes, with the median aggregations being most affected by the 0-inflated expression (as shown on the iPSC Smart-Seq2 data in Additional file 1: Fig. S14, Additional file 1: Fig. S15).

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol