Kmeans: random starting centroids
One of the most common technique for clustering is Kmeans (1). I have already written a few words about clustering algorithms on this blog.
The main drawbacks of Kmeans are certainly the numeric consideration of the parameters, the unknown number of clusters K and the random starting centroid locations. The paper by Huang (2) is a possible solution to the first. For the second, cluster validation techniques can be used to find a reliable value for K.
In this post, I want to discuss about the third one. In Kmeans, the K initial centroids are chosen randomly. Therefore, running P times can result in P different clustering of the same data. A possible strategy for avoiding such a problem is multiple runs. The Kmeans procedure is done P times on the data and evaluated using a cluster validity index. Then, the run with the maximum (or minimum depending on the validity index) value is chosen as the clustering result.
(1) Jain, A. and Dubes, R.: 1988, Algorithms for Clustering Data, Prentice Hall.
(2) Huang, Z.: 1998, Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2(3), 283.
Comments
4 Comments on Kmeans: random starting centroids

damien francois on
Tue, 6th Mar 2007 8:10 pm

Will Dwinnell on
Wed, 7th Mar 2007 2:48 am

John Aitchison on
Sat, 10th Mar 2007 1:50 am

Sandro Saitta on
Wed, 14th Mar 2007 1:11 pm
Your proposal can be improved at no cost by performing a simlulated annealing that basically consists in performing several runs of the gradient descent (the KMEANS algorithm) successively, in a manner that aims at highering the chances to find the global optimum of the cluster validation criterion.
See Refining Initial Points for KMeans Clustering for a more sophisticated kmeans initialization procedure.
Also, I’ve seen (but can’t remember where, offhand) a modified kmeans in which the centroid calculation is performed using only a subsample of the data (the reassignment of examples to clusters occurs over all examples every time, though).
I am curious as to why you chose kmeans for your task at hand. If it is for speed, then the idea of simulated annealing or running the procedure m times will surely negate that benefit.
I guess it depends on your data, but have you looked into model based clustering/mixture based – such as MCLUST and MULTIMIX?
First, thanks to Damien and Will for their suggestions and links.
Concerning Kmeans, it was a starting choice for me. I was studying cluster validity and was using Kmeans as it is the most simple example of clustering (to my opinion). And since I usually follow Occam’s Razor principle, I started this way. Even if it is working fine for my data (1000 points in 20 dimensions) at least in the speed point of view, your propositions of course may give better results. Therefore, for me, using Gaussian mixture or fuzzy clustering (for example) can be tasks for future work.
Tell me what you're thinking...