Although often used in practice, K-means has several drawbacks. The number of clusters has to be defined in advance and the algorithm is dependent upon the starting centroid locations. More details on how to handle these issues can be found on Data Mining Research (search for clustering in the upper bar).

A weakness, which is common to clustering in general, concerns the visualization of the obtained clusters. A possible solution is to preprocess the data using PCA (1). First, the PCA procedure is applied to the data. Using the principal components the data is mapped into the new feature space. Then, the k-means algorithm is applied to the data in the feature space. The final objective is to be better able to distinguish the different clusters. The following picture shows the difference between plotting the data with two random parameters and the two first principal components.

(1) I.T. Jolliffe. Principal Component Analysis. Springer, 2002.

Hi Sandro,

I am not sure that I agree with this use of PCA and k-means.

If you already have a cluster solution, you can visualize it in 2 dimensional space using, for instance, multidimensional scaling.

Or you could plot the data points (colorized according to groups) on the first 2 linear discriminant functions.

If you first transform your data to just the first two principal components (before clustering), you are emphasizing the variables with the greatest variance ..however, these might not be the variables that contribute most to “cluster separation”.

Hi John,

Thanks for the comment. I realized that my post was not clear enough. When using PCA, I meant I use

allthe PCs when doing clustering. Then, I onlyplotthe two first PCs. Hope this is clearer.hi, sandro..

I’m th fourth year in undergradute, and i have final project about clustering speech signal for identification their phoneme using K-Means.

I still confuse about how to give label in the clusters as result of clustering. Could you help me to give any solutions?

thanks for your attention..

Sandro, I still don’t think I agree with your approach. When using ALL the PC’s you are effectively using all the (linear) information in the original variable set .. the usual reason for PCA is to reduce the dimensionality of the data by dropping some of the (later) PCs.

Then there is the question of variances .. if you use the eigenvalues ie do not rescale the components to a constant variance, then the first PC will dominate the k means solution. If you do rescale them all to common unit varaiance, then this means that factors (PCs) that have less variance in the original data will now have an equal effect as all other factors.. this may not be what you want.

Finally, I suspect that using PCA will smooth out any “interestingness” in your original data. By the Central Limit Theorem a weighted combination of non Gaussian variables will tend to be more Gaussian .. that is not “interesting”. Suppose some of your original variables were bimodal .. that is “interesting” and will influence a k-means solution. Adding these bimodal variables to form a “factor” (PC) will disappear that “interestingness”.

nie,

I don’t quite know what you mean when you say “labels” .. you can just call them cluster 1, cluster 2 etc.. But if you examine the means on each variable you will find differences across clusters.. so cluster 1 might have higher means than all other clusters on variables 17 and 23. This might then allow you to give a “meaningful name” to this cluster eg “near fricatives”

Hi John,

Sorry for the late answer. I’ve had other stuff to do, but I’m still on the issue of combining PCA and K-means. I have made some test which in my case shows that using PCA or no PCA before K-means doesn’t change the clustering solution. However, it may be specific to my data set and I still have to think about it. Thanks for your help.

John and Sandro –

The crux of PCA is the creation of orthogonal dimensions, which is why it creates a more palatable visual solution. No inaccuracy is created by clustering in either the physical space or the PCA space. K-means also does not ensure global variance minimums so the K-means solution applied directly may not be the most “efficient” solution in that sense.

In fact, in the literature is an article suggesting PCA as the continuous solution to k-means clustering; in that literature it suggests both using PCA as a basis for clustering OR using a method that approximates connectivity. Further, it is also established that the subspace spanned by the PCs is identical to the subspace defined by the cluster centroids.

In truth, PCA (given its close relationship to SVD) can have significant advantages of LDA where you have to make particular distributional assumptions that are quite strict. it’s just a matrix decomposition, after all.

Nice post