Given that spatial domain or cell type identification is the primary objective of clustering methods, we aim to conduct a thorough performance comparison using ARI when manual annotation serving as ground truth is available. Some deep learning-based methods and all statistical methods fix the seed to produce deterministic output, some deep learning-based methods do not fix the seed in the practice. To address the variances in performance, we computed the average ARI from 20 runs on each dataset and displayed these results using box plots and a heatmap plot to enhance comparison and visualization. Additionally, since there are 33 ST slices across eight different datasets, it is challenging to rank the overall performance solely based on the average ARI heatmap plot. Therefore, we also provided another heatmap for the overall ranking. This ranking heatmap was generated by normalizing all results within the same slice by dividing them by the maximum ARI value (representing the best performance) among all methods, thereby standardizing all ARI values to 1. With 33 data slices in total, for each method, the best ranking for the sum result is 33, while the best ranking for the average result is 1. To ensure fairness, the rank scores were averaged exclusively over feasible ST data, excluding instances with NaN values. We performed the same analysis based on the NMI, AMI, and HOM metrics.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.