Cluster validity: Clustering algorithms
Now that the clustering ideas have been introduced, let’s look at existing clustering strategies. Several clustering techniques can be found in the literature. They can be divided in four main categories (1): partitional clustering (K-means, etc.), hierarchical clustering (BIRCH, etc.), density-based clustering (DBSCAN, etc.) and grid-based clustering (STING, etc.). In the literature, clustering can be found under different expression such as unsupervised learning, numerical taxonomy and partition (2).
One of the most common technique for clustering is K-means (3). Main reasons can be found among other categories drawbacks (even if k-means has its own drawbacks). Hierarchical clustering, for example, usually has a higher complexity such as O(n^2). Density-based clustering algorithms often have non-intuitive parameters to tune. Finally, grid-based clustering algorithms not always give clusters of good quality (1).
Main advantages of K-means are its computational efficiency and its simplicity to understand the results. Bolshakova and Azuaje (4) thinks that K-means is the most widely used clustering algorithm in practice. This last point is a good indicator of its efficiency in real-life situations. The main drawbacks of K-means are certainly the random centroid locations and unknown number of clusters K. This number has to be known in advance and is an input in the standard K-means algorithm. That’s where cluster validity enters in the game. And this is for the next post.
(1) M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J. of Intelligent Information Systems, 17(2-3):107-145, 2001.
(2) S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
(3) A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
(4) N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825-833, 2003.