One of the most common technique for clustering is K-means (1). I have already written a few words about clustering algorithms on this blog.

The main drawbacks of K-means are certainly the numeric consideration of the parameters, the unknown number of clusters K and the random starting centroid locations. The paper by Huang (2) is a possible solution to the first. For the second, cluster validation techniques can be used to find a reliable value for K.

In this post, I want to discuss about the third one. In K-means, the K initial centroids are chosen randomly. Therefore, running P times can result in P different clustering of the same data. A possible strategy for avoiding such a problem is multiple runs. The K-means procedure is done P times on the data and evaluated using a cluster validity index. Then, the run with the maximum (or minimum depending on the validity index) value is chosen as the clustering result.

(1) Jain, A. and Dubes, R.: 1988, Algorithms for Clustering Data, Prentice Hall.

(2) Huang, Z.: 1998, Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2(3), 283.

Your proposal can be improved at no cost by performing a simlulated annealing that basically consists in performing several runs of the gradient descent (the K-MEANS algorithm) successively, in a manner that aims at highering the chances to find the global optimum of the cluster validation criterion.

See Refining Initial Points for K-Means Clustering for a more sophisticated k-means initialization procedure.

Also, I’ve seen (but can’t remember where, off-hand) a modified k-means in which the centroid calculation is performed using only a sub-sample of the data (the re-assignment of examples to clusters occurs over all examples every time, though).

I am curious as to why you chose k-means for your task at hand. If it is for speed, then the idea of simulated annealing or running the procedure m times will surely negate that benefit.

I guess it depends on your data, but have you looked into model based clustering/mixture based – such as MCLUST and MULTIMIX?

First, thanks to Damien and Will for their suggestions and links.

Concerning K-means, it was a starting choice for me. I was studying cluster validity and was using K-means as it is the most simple example of clustering (to my opinion). And since I usually follow Occam’s Razor principle, I started this way. Even if it is working fine for my data (1000 points in 20 dimensions) at least in the speed point of view, your propositions of course may give better results. Therefore, for me, using Gaussian mixture or fuzzy clustering (for example) can be tasks for future work.