Cluster validity: Introduction to clustering

November 21, 2006 by
Filed under: clustering, unsupervised learning, validity index 

In the near future, I will use this blog to write about recent research I’m involved in. I start today (and the following days) by an introduction on the topic I’m interested in: cluster validity.

Clustering is certainly the best known example of unsupervised learning. The goal of clustering is to group data points that are similar according to a given similarity metric (by default Euclidean distance is used). As Jain et al. write in (1), “clustering is a subjective process […] This subjectivity makes the process of clustering difficult“. Clustering techniques have been applied in various domains such as text mining, color image segmentation, sensory time series, information exploration and automatic counting in video sequences. In these domains, the number of clusters is usually not known in advance.

On goal of cluster validity is to estimate the most reliable number of clusters in a dataset. Before going into more details about cluster validity, next post will focus on the clustering techniques.

(1) A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999.



2 Comments on Cluster validity: Introduction to clustering

  1. Will Dwinnell on Fri, 24th Nov 2006 9:18 pm
  2. I’m curious as to whether you’ve investigated k-harmonic means (KHM) clustering? Some authors claim that KHM produces “better” clusters, though I suspect that this translates to clusters which better satisfy those authors’ preferred measure of validity.


  3. Sandro Saitta on Mon, 27th Nov 2006 8:37 pm
  4. Thanks for the references, I will have a look at them as I don’t know KHM.

Tell me what you're thinking...

  • Swiss Association for Analytics

  • Most Popular Posts

  • T-shirts, Mugs & Mousepads

    All benefits given to a charity association
  • Archives

  • Visitors