Hierarchical clustering and semantic categories induction

We applied the agglomerative hierarchical clustering algorithm with “bottom-up” manner to cluster constructed semantic feature matrix and generated clusters based on criteria sentences similarity. Hierarchical clustering is a tree based clustering and it is easy to choose the parameters. We used hierarchical clustering to perform human–computer interaction for identifying categories and labeling ground truth. It starts by treating each criteria sentence as a separate cluster, and then merges two clusters that most closest based on distance similarity measurement into one cluster. Repeat until only a single cluster remains. In order to better summarize categories, we involved two biomedical researchers reviewed the clustering results, merged similar clusters by judging similarity of their criteria sentences expressions, and generalized the semantic categories.

We implemented the algorithm using Python library scikit-learn version 0.24.0. The parameters for sentences similarity measure was set to Euclidean, clusters similarity measure was set to Average Linkage Method. Distance threshold means the minimum similarity of criteria sentences in one cluster. The high distance threshold would generate a few large clusters, while a low distance threshold would generate many small clusters. We set the threshold to 0.65.

Note: The content above has been extracted from a research article, so it may not display correctly.



Q&A
Please log in to submit your questions online.
Your question will be posted on the Bio-101 website. We will send your questions to the authors of this protocol and Bio-protocol community members who are experienced with this method. you will be informed using the email address associated with your Bio-protocol account.



We use cookies on this site to enhance your user experience. By using our website, you are agreeing to allow the storage of cookies on your computer.