Selective sampling–based clustering ensemble

Van Hoan Do; Francisca Rojas Ringeling; Stefan Canzar

Improve Research Reproducibility A Bio-protocol resource

Home
Protocols

Concise Method

Selective sampling–based clustering ensemble

VD Van Hoan Do

FR Francisca Rojas Ringeling

SC Stefan Canzar

This method is extracted from research article: Genome Res, Apr 2021

Linear-time cluster ensembles of large-scale single-cell RNA-seq and multimodal data

DOI: 10.1101/gr.267906.120

Request a Protocol

Ask a question

Favorite

With a running time that scales quadratically with the number of cells, the application of Algorithm 1 to large-scale scRNA-seq data sets becomes infeasible. We therefore apply step 3 of our clustering ensemble approach (Algorithm 1) to a carefully selected sketch of the data. However, the co-association matrix H built in step 3 of the algorithm is based on cluster labels that were learned from the full data in step 2 using our tailored LSC algorithm. In addition, we propose a simple sampling technique that uses all clusterings computed in step 2 to guide the selection of cells.

Sampling cells uniformly at random is naturally fast, because the decision to include a given cell into a sketch does not depend on any other cell. At the same time, these independent decisions ignore the global structure of the data such as the abundance of different cell types and may thus lead to a loss of rare cell types (Hie et al. 2019b). We therefore propose a sampling approach that uses the clusterings of the data computed in step 2 of Algorithm 1 to inform the (fast) selection of cells. More specifically, let Π = {π₁, π₂, …, π_m}, where π_i = (π_i1, π_i2, …, π_ik) is the ith clustering returned in step 2 of Algorithm 1, i = 1, 2, …, m. We select a sketch S of size $[k \sqrt{n}]$ that contains roughly the same number of cells in each cluster π_ij, for all i and j. This selective sampling procedure iterates through all clusters contained in all clusterings from which it randomly picks a cell not already contained in the sketch, until the size of the sketch reaches $[k \sqrt{n}]$ (see Algorithm 2).

1. Input: Component clusterings Π = {π₁, π₂, …, π_m}, number of clusters k.

2. Initialization: $S = \emptyset$ .

3. while $| S | < k \sqrt{n}$ do

4 for i = 1 to m do

5. for j = 1 to k do

6. Randomly select a cell s from $π_{i j} ∖ S$

7. $S = S \cup {s}$

8. end

9. end

10. end

Given a selectively sampled sketch S, we apply steps 3 and 4 in Algorithm 1 to cells in S, using labels obtained from the full data in step 2. That is, we construct a co-association matrix whose entries count the number of times the two corresponding cells in S were placed in the same cluster by a run of the LSC algorithm in step 2. From this matrix, we compute a consensus clustering of S using hierarchical clustering and finally transfer cluster labels to the remaining cells using supervised k-nearest-neighbors classification. That is, we assign each cell not in S to the cluster that the majority of its k nearest neighbors were placed in by the preceding consensus clustering of S. Our selective sampling-based cluster ensemble approach is summarized in Algorithm 3.

1. Input: Cells x₁, …, x_n; number of clusters k.

2. Run the tailored LSC algorithm for a varying number of landmarks and different kernel bandwidths. Let Π = {π₁, π₂, …, π_m} be the set of m clusterings.

3. Run selective sampling (Algorithm 2) on Π to obtain a sketch S of size $| S | = [k \sqrt{n}]$ .

4. Summarize all clusterings of cells in S computed in step 2 in a co-association matrix H^S.

5. Apply single linkage hierarchical clustering to H^S to obtain k clusters for S.

6. Transfer labels to full data using k-nearest-neighbors classification.

Landmark-based spectral clustering performed in step 2 of Algorithm 3 takes O(n) (see above). Because we selectively sample a sketch of size $| S | = O (\sqrt{n})$ in step 3, the complexity of steps 4 and 5 now reduces to O(n). Together with the k-NN classification that runs in O(n) in step 6, our selective sampling-based cluster ensemble scheme scales linearly with the number of cells n.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

Post a Question

0 Q&A

Share your protocol with your peers.

Submit a Preprint Protocol