Cluster validity: Existing indices
The third – and final – post on cluster validity is about existing validity indices. As written in (1), the two fundamentals issues in cluster validity are 1) the number of clusters present in the data and 2) how good is the clustering itself.
Several indices have been proposed in the literature. The main idea with these indices is to plot them with regard to the number of clusters and then analyze this plot. Dunn Index (2) combines dissimilarity between clusters and their diameters to estimate the most reliable number of clusters. Dunn Index is computationally expensive and sensitive to noise. Silhouette index (3) uses average dissimilarity between points to show the structure of the data and consequently its possible clusters. Silhouette index is only suitable for estimating the first choice or best partition. The concepts of dispersion of a cluster and dissimilarity between clusters are used to compute DaviesBouldin index (4). According to (5), DaviesBouldin index is among the best indices.
Silhouette, Dunn and DaviesBouldin indices require the definition of at least two clusters. Finally, I want to point out the fact that several other indices exist in the literature. Some are computationally expensive while other are unable to discover the real number of clusters in certain datasets (5).
(1) U. Maulik and S. Bandyopadhyay. Performance evaluation of some clustering algorithms and validity indices. IEEE Trans. Pattern Anal. Mach. Intell., 24(12):16501654, 2002.
(2) J.C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95104, 1974.
(3) L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.
(4) D.L. Davies and W. Bouldin. A cluster separation measure. IEEE PAMI, 1:224227, 1979.
(5) M. Kim and R.S. Ramakrishna. New indices for cluster validity assessment. Pattern Recogn. Lett., 26(15):23532363, 2005.
Comments
10 Comments on Cluster validity: Existing indices

jackie on
Thu, 14th Dec 2006 10:00 am

Sandro Saitta on
Fri, 15th Dec 2006 10:38 pm

jackie on
Thu, 21st Dec 2006 9:21 am

Sandro Saitta on
Thu, 21st Dec 2006 9:44 am

jackie on
Thu, 21st Dec 2006 10:19 am

Sandro Saitta on
Thu, 21st Dec 2006 12:43 pm

jackie on
Mon, 25th Dec 2006 1:00 pm

Anonymous on
Tue, 20th Jan 2009 7:53 pm

Sandro Saitta on
Wed, 21st Jan 2009 2:31 pm

sneha on
Thu, 3rd Feb 2011 7:15 am
hello, i heard that there is an indice of clustering validity, called the overall mean inner cluster similarity and the overall mean inter cluster similarity, but i don’t know how to compute it. do you know that? can you tell me?
Hi Jackie,
I would be glad to help you if you give me some more details about the validity index you’re talking about.
Do you have an author name, a date?
oh, yes, i know it’s from the book: Finding groups in data. an introduction to cluster analysis, but i can’t find the source. so i would appreciate it if you can help me. it’s urgent, thanks!
The book you mention is from Kaufman and Rousseeuw. They have developed the Silhouette width. I don’t know if this is the validity index you’re talking about. I will see if I can get the book to have a look inside it.
thank you for help. i just want to know how to compute the index of “the overall mean inner cluster similarity” and “the overall mean inter cluster similarity”. by the way, i’m doing some work on clustering the web logs right now, do you know any validity index except the recommendation accuracy because i don’t want to do recommendation right now. thanks again!
Sorry, I cannot find any information on “overall mean inner cluster similarity”. I have look in the book you mentioned, but was not able to find anything. I have also look on Google and the search about “overall mean inner cluster similarity” is only giving two hits (on of them is this blog). So it is quite low for a validity index. I think it should have a different name in the literature.
My advise is to use another validity index such as Silhouette or DaviesBouldin. Dunn index is not computationaly efficient. There are many other validity indices existing. I think Silhouette is a good first try. A paper for a very short introduction (in order to implement it) can be this one: “Cluster Validation Techniques for Genome Expression Data”. You can find it using Google.
hi, merry christmas! thank you for your advice. i’m trying,….i’ll let you know as soon as i have the result.
hi there, does anyone know the SDbw index proposed by Halkidi et al.? I’ve implemented the SDbw index, but my implementation of the index does not minimise with the optimal number of clusters. Any code or a link to source code of this index will be appreciated. Thx.
Hello,
I’m not sure, but I think you can find a Matlab version of this code on Mathworks. You should look for codes using “clustering”, “cluster index”, “cluster indices” as search terms.
Hope it helps.
how is the spread of data computed, if i know cov matrix in multivariate
Tell me what you're thinking...