In this subsection, we explain the research methods adopted to propose the research work approach. The approach aims to develop an optimized clustering method for analyzing the distribution of pixels in the histological and near-infrared (NIR) prostate cancer images. It is built based on the traditional k-means clustering algorithm and elbow criterion technique described in the following subsections.
Clustering methods are algorithms used for grouping the objects or regions into clusters based on their attributes. Each cluster can be defined as a group of instances which are similar between them and dissimilar to the instances, which are in other clusters. In machine learning field, clustering-based methods belong to unsupervised learning concept. The unsupervised learning concept can train from the features of available unlabeled data. In other words, clustering can be defined as a static-classification of similar entities or objects into several different groups or more subsets. Consequently, the member objects in the same group have similar attributes, which is commonly contained in the coordinate system with shorter spatial distance. In prostate cancer image segmentation and analysis, similar pixels are grouped together to form clusters or segments according the similarity condition, which are well-defined between pixels [17]. K-means is one of more common and effective unsupervised algorithms, used for clustering task. It has a good clustering effect and simple to implement. There are some variants of the k-means algorithm. The simple notion behind the k-means algorithm is explained by the following lines. Suppose that a given dataset of data points (x1, x2, x3,…, xn) is divided into K number of clusters (C1, C2, C3,…, Ck) regarding the distance value between the data points in this dataset. Then, the goal is minimizing the sum of squared errors within cluster(SSEWC), which can be computed as follows:
where |Ci| represents the size of cluster i, var Ci is the variance of cluster i, and μi is the mean of data points in Ci. This is similar to minimizing the data points pairwise squared deviations of the same cluster as follows:
Also, we deduce the difference between data points from their mean using identity as follows:
By equivalence and because the total variance is normally constant, minimizing the sum of squared errors within cluster (SSEWC) is equal to maximizing the sum of squared error between clusters (SSEBC) or between data points of different clusters, computed as follows:
where μ is the mean of clusters, μi is the mean of data points in Ci, and |Ci| is the size of cluster i.
Hence, the k-means algorithm aims to make the distance between the clusters of the dataset as large as possible and make the distance between the data points of the same cluster as small as possible.
The main steps of the k-means clustering algorithm can be summarized as follows:
K centroids are created; then, each point is assigned to the nearest centroid.
The centroid is recalculated. This process is repeated several times until the result of cluster allocation of data points no longer changes.
The main advantages of k-means include a number of points: first, it belongs to unsupervised learning; no training set is required; second, the principle of k-means is very simple, and it is easier to implement; and finally, the result of k-means is better interpretable. In contrast, one disadvantage of k-means comes in the difficulty of choosing an inappropriate value of K that might lead to poor clustering results. This is why it is necessary to perform feature checks to determine the number of clusters in the dataset, and this depends on the application domain. Other disadvantages of the k-means algorithm represented in it might converge to a problem of local minimum; it has a slower convergence on large-scale of data sets; and it is more sensitive to outliers.
The proposed method aims to segment or cluster and analyze pixels of histological and near-infrared (NIR) prostate cancer images by improving and optimizing the traditional k-means algorithm by finding the number of clusters through elbow criterion technique. Elbow criterion technique is a heuristic method applied to determine the number of clusters of the data points in a dataset. Elbow technique is used to obtain the optimal number of clusters for a set of data points because it is an empirical method, simple and easy to implement. By applying the k-means clustering algorithm, the elbow technique plots the explained variations through the number of clusters and picks the curve of elbow to get the number of clusters. It depends on computing the sum of squared errors within-cluster (SSEWC) given in Equation (1) of all data points to represent the quality of aggregation between data point in the same cluster and separation between clusters. The core idea of the elbow technique is explained as follows:
As the number of clusters K increases, the model separation is more distinguished, and the amount of aggregation for each cluster will gradually be increased; therefore, the SSEWC will gradually and obviously become smaller.
When K is less than the true number of clusters, increasing the value of K significantly increases the amount of aggregation for each cluster. SSEWC will be decreased, and once K will reach the true number of clusters, the amount of aggregation achieved using K will be increased. Therefore, the return will be decreased quickly, and the SSEWC will be decreased sharply. Consequently, it will be flattened out when the K value remains increasing. This means that the relationship between SSEWC and K can be represented in the elbow shape corresponding to the K value, which is the number of true clusters for the example points. For instance, in Figure 3, it obvious that the highest curvature of elbow is at K value of 4. Hence, the best number of clusters will be 4.
An example of how elbow criterion technique works.
Do you have any questions about this protocol?
Post your question to gather feedback from the community. We will also invite the authors of this article to respond.