1
votes

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.

These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.

Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.

I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.

I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?

1
Have you considered silhouettes?Dan
@Dan Thank you for the reference, I am new to this and have not considered silhouette yet. As I understand, comparing the silhouette-value in both the cases ( same data points in 'unprocessed' 73 dimensions and 'processed' 3 dimension ) then probably should not require any other form of normalization, is that the case?Yeshi
Since kmeans is a classifier, you can compare two different classifiers (in your case kmeans with 73 dim vs kmeans with 3 dim) by looking how many images where correctly classified in each classifier. Also since kmeans assigns the label of the closes cluster, you can have an idea of how robust is the model by comparing the distance to the closest cluster with the distance to the second closest cluster. A "big" difference between this distances translates to a good robustness against noise (low probability of misclassification due to noise).Sembei Norimaki

1 Answers

1
votes

I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.

  1. Look at the confusion matrices between the two approaches.
  2. Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
  3. Look at the Vornoi cells and see how far your points are from the boundaries of the cells.

The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.