Cluster validity: Clustering algorithms

November 22, 2006 by
Filed under: clustering algorithms, k-means 

Now that the clustering ideas have been introduced, let’s look at existing clustering strategies. Several clustering techniques can be found in the literature. They can be divided in four main categories (1): partitional clustering (K-means, etc.), hierarchical clustering (BIRCH, etc.), density-based clustering (DBSCAN, etc.) and grid-based clustering (STING, etc.). In the literature, clustering can be found under different expression such as unsupervised learning, numerical taxonomy and partition (2).

One of the most common technique for clustering is K-means (3). Main reasons can be found among other categories drawbacks (even if k-means has its own drawbacks). Hierarchical clustering, for example, usually has a higher complexity such as O(n^2). Density-based clustering algorithms often have non-intuitive parameters to tune. Finally, grid-based clustering algorithms not always give clusters of good quality (1).

Main advantages of K-means are its computational efficiency and its simplicity to understand the results. Bolshakova and Azuaje (4) thinks that K-means is the most widely used clustering algorithm in practice. This last point is a good indicator of its efficiency in real-life situations. The main drawbacks of K-means are certainly the random centroid locations and unknown number of clusters K. This number has to be known in advance and is an input in the standard K-means algorithm. That’s where cluster validity enters in the game. And this is for the next post.

(1) M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. J. of Intelligent Information Systems, 17(2-3):107-145, 2001.
(2) S. Theodoridis and K. Koutroumbas. Pattern Recognition. Academic Press, 1999.
(3) A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
(4) N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Process., 83(4):825-833, 2003.



2 Comments on Cluster validity: Clustering algorithms

  1. Nimit on Wed, 22nd Nov 2006 7:06 pm
  2. you may like to add this reference:
    James C. Bezdek and Nikhil R. Pal. Some New Indexes of Cluster Validity. SMCB, 28(3):301–315, 1998.

  3. Sandro Saitta on Thu, 23rd Nov 2006 9:58 am
  4. Thanks! This reference is an important contribution to cluster validity. I have many other references if needed.

Tell me what you're thinking...

  • Swiss Association for Analytics

  • Most Popular Posts

  • T-shirts, Mugs & Mousepads

    All benefits given to a charity association
  • Data Mining Search Engine

    Supported by AnalyticBridge

  • Archives

  • Reading Recommandations