The model specification in NEBULA

LH Liang He
JD Jose Davila-Velderrain
TS Tomokazu S. Sumida
DH David A. Hafler
MK Manolis Kellis
AK Alexander M. Kulminski
ask Ask a question
Favorite

Consider a raw count matrix of n cells from m individuals in a single-cell data set. We choose to use an NBM to take into account the uncertainty originating from the Poisson sampling process, biological effects, and technical noise. Denote by yij the raw count of a gene and by xij=(xij0,,xijk)R1×(k+1) a row vector of k fixed-effects predictors and the intercept (xij0=1) of cell j (j{1,,ni} and ini=n) in individual i (i{1,,m}). We model the raw count using the following NBMM

where NB is a negative binomial distribution parametrized by its mean μ and cell-level dispersion parameter ϕ such that the variance is μ+μ2/φ, βR(k+1)×1 is a vector of the coefficients of the predictors, ωi are subject-level random effects capturing the between-subject overdispersion, and πij is a known cell-specific scaling factor reflecting the sequencing depth, which can be the library size or some global normalizing factor (e.g., ref. 52). Note that πij can also be gene-specific as recent studies show that a single normalizing factor might be insufficent to normalize scRNA-seq data53. The negative binomial log-linear mixed model (NBLMM) proposed in ref. 54 assumes a log-normal distribution for the random effects ωi, i.e.,

where σ2 is the variance component of the between-subject random effects.

The estimation of the NBLMM poses a computational challenge because the marginal likelihood is intractable after integrating out the random effects ωi. Booth et al.54 originally propose an estimation method based on an Expectation-Maximization (EM) algorithm. As the NBLMM falls in the framework of the GLMM, conventional estimation methods for the GLMM can also be applied, including penalized likelihood-based methods12,13, Laplace approximation16, AGQ24,25. These methods, implemented in the lme4 R package18 through the argument nAGQ, differ in how to approximate the high-dimensional integral in the marginal likelihood, among which the penalized likelihood-based method is the fastest but the least accurate while AGQ is more accurate and much slower. We refer to ref. 55 for more extensive reviews of these methods. More recently, Zhang et al.17 propose an estimation procedure for the NBLMM by transforming the problem into repeatedly fitting a linear mixed model using iteratively reweighted least square. All these methods involve at least a two-layer optimization procedure. Although the time complexity is O(n) under the assumption of the independent random effects, the two-layer procedure often requires hundreds of iterations to converge, which becomes the major computational bottleneck. On the other hand, Bayesian methods such as MCMC26 offer better accuracy, but are even slower than the above methods.

The rationale behind NEBULA is to skip the time-consuming two-layer optimization algorithm for as many genes as possible and to substantially reduce the number of iterations if such an algorithm has to be applied. The key idea is to derive an approximate marginal likelihood in a closed-form so that the model can be estimated using a simple Newton-Raphson (NR) or quasi-Newton method. Inspired by ref. 28, instead of the log-normal distribution in eq. (2), we assume that the between-subject random effects follow the following gamma distribution

where α and λ are the shape and rate parameters, respectively. One advantage of this parametrization is that it matches the first two moments to the NBLMM, so that even in a situation where eq. (2) is the true distribution of the random effects, Eq. (3) can still provide an accurate estimate of σ2 when it is not large28. We call the NBMM with the random effects defined in Eq. (3) as a negative binomial gamma mixed model (NBGMM). Under the NBGMM, the marginal mean and variance of yij are given by

from which we see that the first overdispersion term in Eq. (4) is controlled by the between-subject variance component σ2 alone, and the other includes both σ2 and ϕ (The derivation can be found in Supplementary Note A.1). Previous studies20,56,57 indicate that it has practically little influence on the estimate of fixed-effects predictors, particularly at the cellular level, to assume a log-normal, gamma, or even misspecified distributions for the random effects. Nevertheless, misspecified random-effects distribution has a larger impact on the estimate of subject-level predictors if the number of subjects is small. Our real data analysis (Fig. 5c and Supplementary Fig. S14) showed concordant results with these studies. In general, the marginal likelihood of the NBGMM after integrating out ωi cannot be obtained analytically. However, we show in the next section how to achieve an approximated closed form of the marginal likelihood when ni is large. We first focus on the derivation in an asymptotic situation, and then investigate its practical performance in a finite sample.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

post Post a Question
0 Q&A