For each HTO j, a negative binomial regression mixture model with two components representing negative droplets (not containing cells tagged by HTO j) and positive droplets (containing cells tagged by HTO j) is used to model the distribution of yi,j (Wedel and DeSarbo, 2002). Let h denote the probability mass function of the negative binomial distribution. The mixture distribution for the j-th HTO can be written as
where the vector ϕj = (πj,1,θj,1,πj,2,θj,2) denotes all model parameters. πj,1,πj,2 > 0, πj,1 + πj,2 = 1, describe the unknown proportions of negative and positive droplets. In a regression mixture model, the means of the observations in each component are predicted from explanatory variables. Here, the number of detected genes in the droplet is used to predict the HTO counts using a negative binomial regression model with canonical link function g:
The vector θj, = (αj,k,βj,k, vj,k) contains the regression coefficients and the dispersion parameter of the k-th component.
The model parameters ϕj are estimated by the expectation-maximization (EM) algorithm, which iteratively estimates the regression parameters in equation (2) and then updates the class memberships of the droplets. The EM algorithm is initialized using the clustering results from the preprocessing step: droplets from the positive cluster with the larger mean HTO count are assigned to component k=2. Thus, component k=2 represents positive droplets. After the parameters have been estimated, the posterior probability that droplet i contains a cell from the sample tagged with HTO j can be calculated:
If droplet i is more likely to contain a cell from sample j than not, then estimated class is set to 1 and otherwise.
As shown in the Results section, many datasets demonstrate a positive association between x and y∙,j, which can be leveraged to improve the demultiplexing results. However, the association can be absent in the negative component if the HTO counts are very low. Moreover, if different cell types with different RNA contents and cell surface properties are pooled, the regression can be driven by the differences between these heterogeneous cell clusters rather than by the association between x and y∙,j within each cluster. Therefore, demuxmix fits two simpler mixture models in addition to model (1) and selects the best model for each HTO. The first model does not include a regression model for the negative component k=1, i.e., the mean μi,j,1 in equation (1) becomes μj,1 and does not depend on the number of detected genes. The second model removes the regression models from both components and is referred to as naïve mixture model. For each HTO, the model which minimizes the expected classification errors calculated using equation (3)
is selected and used for the subsequent droplet classification.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.