K-means: random starting centroids
One of the most common technique for clustering is K-means (1). I have already written a few words about clustering algorithms on this blog.
The main drawbacks of K-means are certainly the numeric consideration of the parameters, the unknown number of clusters K and the random starting centroid locations. The paper by Huang (2) is a possible solution to the first. For the second, cluster validation techniques can be used to find a reliable value for K.
In this post, I want to discuss about the third one. In K-means, the K initial centroids are chosen randomly. Therefore, running P times can result in P different clustering of the same data. A possible strategy for avoiding such a problem is multiple runs. The K-means procedure is done P times on the data and evaluated using a cluster validity index. Then, the run with the maximum (or minimum depending on the validity index) value is chosen as the clustering result.
(1) Jain, A. and Dubes, R.: 1988, Algorithms for Clustering Data, Prentice Hall.
(2) Huang, Z.: 1998, Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values, Data Mining and Knowledge Discovery 2(3), 283.