Cluster validity: Introduction to clustering
In the near future, I will use this blog to write about recent research I’m involved in. I start today (and the following days) by an introduction on the topic I’m interested in: cluster validity.
Clustering is certainly the best known example of unsupervised learning. The goal of clustering is to group data points that are similar according to a given similarity metric (by default Euclidean distance is used). As Jain et al. write in (1), “clustering is a subjective process […] This subjectivity makes the process of clustering difficult“. Clustering techniques have been applied in various domains such as text mining, color image segmentation, sensory time series, information exploration and automatic counting in video sequences. In these domains, the number of clusters is usually not known in advance.
On goal of cluster validity is to estimate the most reliable number of clusters in a dataset. Before going into more details about cluster validity, next post will focus on the clustering techniques.
(1) A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264-323, 1999.