Cluster validity: Existing indices
The third – and final – post on cluster validity is about existing validity indices. As written in (1), the two fundamentals issues in cluster validity are 1) the number of clusters present in the data and 2) how good is the clustering itself.
Several indices have been proposed in the literature. The main idea with these indices is to plot them with regard to the number of clusters and then analyze this plot. Dunn Index (2) combines dissimilarity between clusters and their diameters to estimate the most reliable number of clusters. Dunn Index is computationally expensive and sensitive to noise. Silhouette index (3) uses average dissimilarity between points to show the structure of the data and consequently its possible clusters. Silhouette index is only suitable for estimating the first choice or best partition. The concepts of dispersion of a cluster and dissimilarity between clusters are used to compute Davies-Bouldin index (4). According to (5), Davies-Bouldin index is among the best indices.
Silhouette, Dunn and Davies-Bouldin indices require the definition of at least two clusters. Finally, I want to point out the fact that several other indices exist in the literature. Some are computationally expensive while other are unable to discover the real number of clusters in certain datasets (5).
(1) U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell., 24(12):1650-1654, 2002.
(2) J.C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95-104, 1974.
(3) L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
(4) D.L. Davies and W. Bouldin. A cluster separation measure. IEEE PAMI, 1:224-227, 1979.
(5) M. Kim and R.S. Ramakrishna. New indices for cluster validity assessment. Pattern Recogn. Lett., 26(15):2353-2363, 2005.