The model specification in NEBULA

Liang He; Jose Davila-Velderrain; Tomokazu S. Sumida; David A. Hafler; Manolis Kellis; Alexander M. Kulminski

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

The model specification in NEBULA

LH Liang He

JD Jose Davila-Velderrain

TS Tomokazu S. Sumida

DH David A. Hafler

MK Manolis Kellis

AK Alexander M. Kulminski

This method is extracted from research article: Commun Biol, May 2021

NEBULA is a fast negative binomial mixed model for differential or co-expression analysis of large-scale multi-subject single-cell data

DOI: 10.1038/s42003-021-02146-6

Ask a question

Favorite

Consider a raw count matrix of $n$ cells from $m$ individuals in a single-cell data set. We choose to use an NBM to take into account the uncertainty originating from the Poisson sampling process, biological effects, and technical noise. Denote by $y_{i j}$ the raw count of a gene and by $x_{i j} = (x_{i j 0}, \dots, x_{i j k}) \in R^{1 \times (k + 1)}$ a row vector of $k$ fixed-effects predictors and the intercept ( $x_{i j 0} = 1$ ) of cell $j$ ( $j \in {1, \dots, n_{i}}$ and $\sum_{i} n_{i} = n$ ) in individual $i$ ( $i \in {1, \dots, m}$ ). We model the raw count using the following NBMM

where $NB$ is a negative binomial distribution parametrized by its mean $μ$ and cell-level dispersion parameter $ϕ$ such that the variance is $μ + μ^{2} / φ$ , $β \in \in R^{(k + 1) \times 1}$ is a vector of the coefficients of the predictors, $ω_{i}$ are subject-level random effects capturing the between-subject overdispersion, and $π_{i j}$ is a known cell-specific scaling factor reflecting the sequencing depth, which can be the library size or some global normalizing factor (e.g., ref. ⁵²). Note that $π_{i j}$ can also be gene-specific as recent studies show that a single normalizing factor might be insufficent to normalize scRNA-seq data⁵³. The negative binomial log-linear mixed model (NBLMM) proposed in ref. ⁵⁴ assumes a log-normal distribution for the random effects $ω_{i}$ , i.e.,

where $σ^{2}$ is the variance component of the between-subject random effects.

The estimation of the NBLMM poses a computational challenge because the marginal likelihood is intractable after integrating out the random effects $ω_{i}$ . Booth et al.⁵⁴ originally propose an estimation method based on an Expectation-Maximization (EM) algorithm. As the NBLMM falls in the framework of the GLMM, conventional estimation methods for the GLMM can also be applied, including penalized likelihood-based methods^12,13, Laplace approximation¹⁶, AGQ^24,25. These methods, implemented in the lme4 R package¹⁸ through the argument nAGQ, differ in how to approximate the high-dimensional integral in the marginal likelihood, among which the penalized likelihood-based method is the fastest but the least accurate while AGQ is more accurate and much slower. We refer to ref. ⁵⁵ for more extensive reviews of these methods. More recently, Zhang et al.¹⁷ propose an estimation procedure for the NBLMM by transforming the problem into repeatedly fitting a linear mixed model using iteratively reweighted least square. All these methods involve at least a two-layer optimization procedure. Although the time complexity is $O (n)$ under the assumption of the independent random effects, the two-layer procedure often requires hundreds of iterations to converge, which becomes the major computational bottleneck. On the other hand, Bayesian methods such as MCMC²⁶ offer better accuracy, but are even slower than the above methods.

The rationale behind NEBULA is to skip the time-consuming two-layer optimization algorithm for as many genes as possible and to substantially reduce the number of iterations if such an algorithm has to be applied. The key idea is to derive an approximate marginal likelihood in a closed-form so that the model can be estimated using a simple Newton-Raphson (NR) or quasi-Newton method. Inspired by ref. ²⁸, instead of the log-normal distribution in eq. (2), we assume that the between-subject random effects follow the following gamma distribution

where $α$ and $λ$ are the shape and rate parameters, respectively. One advantage of this parametrization is that it matches the first two moments to the NBLMM, so that even in a situation where eq. (2) is the true distribution of the random effects, Eq. (3) can still provide an accurate estimate of $σ^{2}$ when it is not large²⁸. We call the NBMM with the random effects defined in Eq. (3) as a negative binomial gamma mixed model (NBGMM). Under the NBGMM, the marginal mean and variance of $y_{i j}$ are given by

from which we see that the first overdispersion term in Eq. (4) is controlled by the between-subject variance component $σ^{2}$ alone, and the other includes both $σ^{2}$ and $ϕ$ (The derivation can be found in Supplementary Note A.1). Previous studies^20,56,57 indicate that it has practically little influence on the estimate of fixed-effects predictors, particularly at the cellular level, to assume a log-normal, gamma, or even misspecified distributions for the random effects. Nevertheless, misspecified random-effects distribution has a larger impact on the estimate of subject-level predictors if the number of subjects is small. Our real data analysis (Fig. 5c and Supplementary Fig. S14) showed concordant results with these studies. In general, the marginal likelihood of the NBGMM after integrating out $ω_{i}$ cannot be obtained analytically. However, we show in the next section how to achieve an approximated closed form of the marginal likelihood when $n_{i}$ is large. We first focus on the derivation in an asymptotic situation, and then investigate its practical performance in a finite sample.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol