At this point we have obtained a partition, , of the collection of cytometries . Next, we want to obtain a prototype cytometry, , for every group of cytometries, i, in the partition (lines 18–21 in Algorithm 1). To address this goal, we resort to k-barycenters using Wasserstein distance, which provide a suitable tool for consensus on probability distributions (see [20]). We propose three different methods to obtain a template cytometry from a group of cytometries, that is, to obtain a consensus (ensemble) clustering on flow cytometries. These methods are given in Algorithms 2, 3 and 4.
The intention behind pooling (Algorithm 2), is to take advantage of having groups of similar cytometries and knowing the actual cell types in them. A prototype of a cell type is obtained through a (1-)barycenter—a consensus representation—of the multivariate distributions that represent the same cell type in the cytometries that are members of the same group in . A prototype cytometry is the collection of prototypes of each cell type. This can be seen in Fig. Fig.2.2. On the left-hand side, we have 5 different cytometries, each with 4 different cell types, hence . Since the cell types are known, we take all the black ellipsoids of the left plot, representing the different normal distributions, and obtain the black ellipsoid on the right plot, the barycenter of the group of normal distributions, as a consensus element for Monocytes. Doing this for every cell type gives us the prototype cytometry represented on the right of Fig. Fig.22.
An application of Algorithm 2-Pooling. On the left we have 5 different cytometries, each with 4 different identified cell types given by . On the right we have a prototype cytometry obtained taking the 1-barycenter for each cell type. Ellipsoids contain 95% of the probability of the respective multivariate gaussian distributions
However, our templates could be obtained even when we have gated cytometries but without identified cell types. This could be the case when unsupervised gating is used to obtain a database of gated cytometries. Density-based hierarchical clustering (Algorithm 3) and k-barycenter (Algorithm 4) are based on the idea that clusters that are close in Wasserstein distance should be understood as representing the same, although we may not know which, cell type. When using k-barycenters we must specify the number of cell types, K, that we want for the artificial cytometry. However, when using density-based hierarchical clustering as HDBSCAN or DBSCAN the selection of the number of cell types for the prototype cytometry is automatic. Recall that both k-barycenters, through trimming, and density-based hierarchical clustering, are robust clustering procedures.
In Figs. Figs.33 and and44 we have a representation of how Algorithms 3 and 4 work. Since we do not have cell type information for the 5 gated cytometries, we obtain the plot that can be seen on the left of Figs. Figs.33 and and4.4. However, the absence of this information can be mitigated using the spatial information, which clearly shows a group structure between the ellipsoids. We use density-based hierarchical clustering and k-barycenters respectively, to try to capture this spatial information. As a result, we obtain the template cytometries on the right side of Figs. Figs.33 and and4.4. Clearly, we see that the templates represent well the real cell types behind the cytometries (compare with Fig. Fig.2),2), although we still do not know the cell types corresponding to each ellipsoid. This could be achieved using expert information or matching populations.
Application of Algorithm 3—densit-based. On the left we have the same 5 cytometries as in Fig. Fig.2,2, but each cytometry is grouped in clusters without cell types being identified. On the right we have a prototype cytometry obtained taking the denisty based hierarchical clustering approach on the cytometries represented on the left. Ellipsoids contain 95% of the probability of the respective multivariate gaussian distributions
Application of Algorithm 4—4-barycenter. On the left we have the same 5 cytometries as in Fig. Fig.2,2, but each cytometry is grouped in clusters without cell types being identified. On the right we have a prototype cytometry obtained taking the 4-barycenter of the cytometries represented on the left. Ellipsoids contain 95% of the probability of the respective multivariate gaussian distributions
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.