K-means Clustering: How to determine which variables influence a cluster?

Question

I am performing a cluster analysis on 86 different variables, which I managed to reduce to 19 PCs using PCA. Using sk-learn's K-means clustering algorithm, I got 10 clusters. However, I can't figure out which variables are responsible for separating these clusters. How do I determine which the variables that are responsible for a certain cluster.

aces_full aces_full · Accepted Answer · 2020-07-21T18:38:19

PCA creates principal components, which can essentially be thought of as some linear combinations of the underlying features, to help reduce the dimensionality from, in your case 86 features, to the 19 "principal components" with the most variance.

In order to understand what discriminative features these principal components are based on, you'd have to dive in to what PCA does under-the-hood. Simply put, PCA does an eigendecomposition of the correlation matrix of the 86 features. It then projects the data onto a new vector space, made up of the 19 eigenvectors with the highest eigenvalues.

In order to get a rough estimate of what features PCA deems "principal", you can manually do an eigendecomposition of the correlation matrix and see which features have the highest eigenvalue. However, keep in mind that this won't be a 1-1 correlation, since PCA uses some linear combination of these 86 features to reduce the dimensionality. However, the eigendecomposition may be able to help you understand the data better.

Also, here is a great explanation of PCA and how it relates to eigendecomposition if you're interested: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues

K-means Clustering: How to determine which variables influence a cluster?

1 Answers