2.1.2.  Regression mixture model

Hans-Ulrich Klein

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

2.1.2. Regression mixture model

HK Hans-Ulrich Klein

This method is extracted from research article: bioRxiv, Jan 2023

demuxmix: Demultiplexing oligonucleotide-barcoded single-cell RNA sequencing data with regression mixture models

DOI: 10.1101/2023.01.27.525961

Ask a question

Favorite

For each HTO j, a negative binomial regression mixture model with two components representing negative droplets (not containing cells tagged by HTO j) and positive droplets (containing cells tagged by HTO j) is used to model the distribution of y_i,j (^{Wedel and DeSarbo, 2002}). Let h denote the probability mass function of the negative binomial distribution. The mixture distribution for the j-th HTO can be written as

where the vector ϕ_j = (π_j,1,θ_j,1,π_j,2,θ_j,2) denotes all model parameters. π_j,1,π_j,2 > 0, π_j,1 + π_j,2 = 1, describe the unknown proportions of negative and positive droplets. In a regression mixture model, the means of the observations in each component are predicted from explanatory variables. Here, the number of detected genes in the droplet is used to predict the HTO counts using a negative binomial regression model with canonical link function g:

The vector θ_j, = (α_j,k,β_j,k, v_j,k) contains the regression coefficients and the dispersion parameter of the k-th component.

The model parameters ϕ_j are estimated by the expectation-maximization (EM) algorithm, which iteratively estimates the regression parameters in equation (2) and then updates the class memberships of the droplets. The EM algorithm is initialized using the clustering results from the preprocessing step: droplets from the positive cluster with the larger mean HTO count are assigned to component k=2. Thus, component k=2 represents positive droplets. After the parameters have been estimated, the posterior probability that droplet i contains a cell from the sample tagged with HTO j can be calculated:

If droplet i is more likely to contain a cell from sample j than not, then estimated class ${\hat{c}}_{i, j}$ is set to 1 and ${\hat{c}}_{i, j} : = 0$ otherwise.

As shown in the Results section, many datasets demonstrate a positive association between x and y_∙,j, which can be leveraged to improve the demultiplexing results. However, the association can be absent in the negative component if the HTO counts are very low. Moreover, if different cell types with different RNA contents and cell surface properties are pooled, the regression can be driven by the differences between these heterogeneous cell clusters rather than by the association between x and y_∙,j within each cluster. Therefore, demuxmix fits two simpler mixture models in addition to model (1) and selects the best model for each HTO. The first model does not include a regression model for the negative component k=1, i.e., the mean μ_i,j,1 in equation (1) becomes μ_j,1 and does not depend on the number of detected genes. The second model removes the regression models from both components and is referred to as naïve mixture model. For each HTO, the model which minimizes the expected classification errors calculated using equation (3)

is selected and used for the subsequent droplet classification.

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol