Selective sampling–based clustering ensemble

VD Van Hoan Do
FR Francisca Rojas Ringeling
SC Stefan Canzar
request Request a Protocol
ask Ask a question
Favorite

With a running time that scales quadratically with the number of cells, the application of Algorithm 1 to large-scale scRNA-seq data sets becomes infeasible. We therefore apply step 3 of our clustering ensemble approach (Algorithm 1) to a carefully selected sketch of the data. However, the co-association matrix H built in step 3 of the algorithm is based on cluster labels that were learned from the full data in step 2 using our tailored LSC algorithm. In addition, we propose a simple sampling technique that uses all clusterings computed in step 2 to guide the selection of cells.

Sampling cells uniformly at random is naturally fast, because the decision to include a given cell into a sketch does not depend on any other cell. At the same time, these independent decisions ignore the global structure of the data such as the abundance of different cell types and may thus lead to a loss of rare cell types (Hie et al. 2019b). We therefore propose a sampling approach that uses the clusterings of the data computed in step 2 of Algorithm 1 to inform the (fast) selection of cells. More specifically, let Π = {π1, π2, …, πm}, where πi = (πi1, πi2, …, πik) is the ith clustering returned in step 2 of Algorithm 1, i = 1, 2, …, m. We select a sketch S of size [kn] that contains roughly the same number of cells in each cluster πij, for all i and j. This selective sampling procedure iterates through all clusters contained in all clusterings from which it randomly picks a cell not already contained in the sketch, until the size of the sketch reaches [kn] (see Algorithm 2).

 1. Input: Component clusterings Π = {π1, π2, …, πm}, number of clusters k.

2. Initialization: S=.

3. while |S|<kn do

4  for i = 1 to m do

5.   for j = 1 to k do

6.    Randomly select a cell s from πijS

7.    S=S{s}

8.   end

9.   end

10.end

Given a selectively sampled sketch S, we apply steps 3 and 4 in Algorithm 1 to cells in S, using labels obtained from the full data in step 2. That is, we construct a co-association matrix whose entries count the number of times the two corresponding cells in S were placed in the same cluster by a run of the LSC algorithm in step 2. From this matrix, we compute a consensus clustering of S using hierarchical clustering and finally transfer cluster labels to the remaining cells using supervised k-nearest-neighbors classification. That is, we assign each cell not in S to the cluster that the majority of its k nearest neighbors were placed in by the preceding consensus clustering of S. Our selective sampling-based cluster ensemble approach is summarized in Algorithm 3.

1. Input: Cells x1, …, xn; number of clusters k.

2. Run the tailored LSC algorithm for a varying number of landmarks and different kernel bandwidths. Let Π = {π1, π2, …, πm} be the set of m clusterings.

3. Run selective sampling (Algorithm 2) on Π to obtain a sketch S of size |S|=[kn].

4. Summarize all clusterings of cells in S computed in step 2 in a co-association matrix HS.

5. Apply single linkage hierarchical clustering to HS to obtain k clusters for S.

6. Transfer labels to full data using k-nearest-neighbors classification.

Landmark-based spectral clustering performed in step 2 of Algorithm 3 takes O(n) (see above). Because we selectively sample a sketch of size |S|=O(n) in step 3, the complexity of steps 4 and 5 now reduces to O(n). Together with the k-NN classification that runs in O(n) in step 6, our selective sampling-based cluster ensemble scheme scales linearly with the number of cells n.

Do you have any questions about this protocol?

Post your question to gather feedback from the community. We will also invite the authors of this article to respond.

0/150

tip Tips for asking effective questions

+ Description

Write a detailed description. Include all information that will help others answer your question including experimental processes, conditions, and relevant images.

post Post a Question
0 Q&A