Combining PCA and Kmeans
Although often used in practice, Kmeans has several drawbacks. The number of clusters has to be defined in advance and the algorithm is dependent upon the starting centroid locations. More details on how to handle these issues can be found on Data Mining Research (search for clustering in the upper bar).
A weakness, which is common to clustering in general, concerns the visualization of the obtained clusters. A possible solution is to preprocess the data using PCA (1). First, the PCA procedure is applied to the data. Using the principal components the data is mapped into the new feature space. Then, the kmeans algorithm is applied to the data in the feature space. The final objective is to be better able to distinguish the different clusters. The following picture shows the difference between plotting the data with two random parameters and the two first principal components.
(1) I.T. Jolliffe. Principal Component Analysis. Springer, 2002.
Comments
7 Comments on Combining PCA and Kmeans

John Aitchison on
Mon, 2nd Apr 2007 9:40 am

Sandro Saitta on
Mon, 2nd Apr 2007 11:27 am

nie on
Tue, 3rd Apr 2007 4:47 am

John Aitchison on
Sun, 8th Apr 2007 11:03 pm

Sandro Saitta on
Thu, 10th May 2007 1:13 pm

Anonymous on
Thu, 26th Jul 2007 4:29 pm

Wrozba Na Dzis on
Fri, 10th Sep 2010 8:37 am
Hi Sandro,
I am not sure that I agree with this use of PCA and kmeans.
If you already have a cluster solution, you can visualize it in 2 dimensional space using, for instance, multidimensional scaling.
Or you could plot the data points (colorized according to groups) on the first 2 linear discriminant functions.
If you first transform your data to just the first two principal components (before clustering), you are emphasizing the variables with the greatest variance ..however, these might not be the variables that contribute most to “cluster separation”.
Hi John,
Thanks for the comment. I realized that my post was not clear enough. When using PCA, I meant I use all the PCs when doing clustering. Then, I only plot the two first PCs. Hope this is clearer.
hi, sandro..
I’m th fourth year in undergradute, and i have final project about clustering speech signal for identification their phoneme using KMeans.
I still confuse about how to give label in the clusters as result of clustering. Could you help me to give any solutions?
thanks for your attention..
Sandro, I still don’t think I agree with your approach. When using ALL the PC’s you are effectively using all the (linear) information in the original variable set .. the usual reason for PCA is to reduce the dimensionality of the data by dropping some of the (later) PCs.
Then there is the question of variances .. if you use the eigenvalues ie do not rescale the components to a constant variance, then the first PC will dominate the k means solution. If you do rescale them all to common unit varaiance, then this means that factors (PCs) that have less variance in the original data will now have an equal effect as all other factors.. this may not be what you want.
Finally, I suspect that using PCA will smooth out any “interestingness” in your original data. By the Central Limit Theorem a weighted combination of non Gaussian variables will tend to be more Gaussian .. that is not “interesting”. Suppose some of your original variables were bimodal .. that is “interesting” and will influence a kmeans solution. Adding these bimodal variables to form a “factor” (PC) will disappear that “interestingness”.
nie,
I don’t quite know what you mean when you say “labels” .. you can just call them cluster 1, cluster 2 etc.. But if you examine the means on each variable you will find differences across clusters.. so cluster 1 might have higher means than all other clusters on variables 17 and 23. This might then allow you to give a “meaningful name” to this cluster eg “near fricatives”
Hi John,
Sorry for the late answer. I’ve had other stuff to do, but I’m still on the issue of combining PCA and Kmeans. I have made some test which in my case shows that using PCA or no PCA before Kmeans doesn’t change the clustering solution. However, it may be specific to my data set and I still have to think about it. Thanks for your help.
John and Sandro –
The crux of PCA is the creation of orthogonal dimensions, which is why it creates a more palatable visual solution. No inaccuracy is created by clustering in either the physical space or the PCA space. Kmeans also does not ensure global variance minimums so the Kmeans solution applied directly may not be the most “efficient” solution in that sense.
In fact, in the literature is an article suggesting PCA as the continuous solution to kmeans clustering; in that literature it suggests both using PCA as a basis for clustering OR using a method that approximates connectivity. Further, it is also established that the subspace spanned by the PCs is identical to the subspace defined by the cluster centroids.
In truth, PCA (given its close relationship to SVD) can have significant advantages of LDA where you have to make particular distributional assumptions that are quite strict. it’s just a matrix decomposition, after all.
Nice post
Tell me what you're thinking...