Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

Question

I have implemented k-means clustering for determining the clusters in 300 objects. Each of my object has about 30 dimensions. The distance is calculated using the Euclidean metric.

I need to know

How would I determine if my algorithms works correctly? I can't have a graph which will give some idea about the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?

The appropriate distance measure depends on the type/source of the data, but Euclidean distance is a good default. — Fred Foo
@Larsmans, can you point to some examples that back that up in >= 30 d ? Not that I disagree, but as you say, it depends; people in image classification, text classification ... seem to work hard on problem-specific metrics. — denis
@Denis: check out the scikit-learn document clustering example, which uses Euclidean distances on 10.000s of features. Also, note that the Euclidean norm is used in all tf-idf-based search engines, which can be thought of as ad hoc clustering algorithms. — Fred Foo
@Larsmans, nice example, thanks. But after X = Normalizer(norm="l2"), "the dot product of two l2-normalized TF-IDF vectors is the cosine similarity ... for the Vector Space Model commonly used by the Information Retrieval community" -- Normalizer doc. So we're far from Euclidean vs L1 ... on unknown data, which seems to be the OP's question. Any more examples ? — denis
@Denis: well, I'm not saying that L1 is necessarily a bad distance metric for clustering. But Euclidean seems to be more common and a reasonable default. — Fred Foo

doug doug · Accepted Answer · 2011-11-14T12:33:16

The two questions in the OP are separate topics (i.e., no overlap in the answers), so I'll try to answer them one at a time staring with item 1 on the list.

How would I determine if my [clustering] algorithms works correctly?

k-means, like other unsupervised ML techniques, lacks a good selection of diagnostic tests to answer questions like "are the cluster assignments returned by k-means more meaningful for k=3 or k=5?"

Still, there is one widely accepted test that yields intuitive results and that is straightforward to apply. This diagnostic metric is just this ratio:

inter-centroidal separation / intra-cluster variance

As the value of this ratio increase, the quality of your clustering result increases.

This is intuitive. The first of these metrics is just how far apart is each cluster from the others (measured according to the cluster centers)?

But inter-centroidal separation alone doesn't tell the whole story, because two clustering algorithms could return results having the same inter-centroidal separation though one is clearly better, because the clusters are "tighter" (i.e., smaller radii); in other words, the cluster edges have more separation. The second metric--intra-cluster variance--accounts for this. This is just the mean variance, calculated per cluster.

In sum, the ratio of inter-centroidal separation to intra-cluster variance is a quick, consistent, and reliable technique for comparing results from different clustering algorithms, or to compare the results from the same algorithm run under different variable parameters--e.g., number of iterations, choice of distance metric, number of centroids (value of k).

The desired result is tight (small) clusters, each one far away from the others.

The calculation is simple:

For inter-centroidal separation:

calculate the pair-wise distance between cluster centers; then
calculate the median of those distances.

For intra-cluster variance:

for each cluster, calculate the distance of every data point in a given cluster from its cluster center; next
(for each cluster) calculate the variance of the sequence of distances from the step above; then
average these variance values.

That's my answer to the first question. Here's the second question:

Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30 ?

First, the easy question--is Euclidean distance a valid metric as dimensions/features increase?

Euclidean distance is perfectly scalable--works for two dimensions or two thousand. For any pair of data points:

subtract their feature vectors element-wise,
square each item in that result vector,
sum that result,
take the square root of that scalar.

Nowhere in this sequence of calculations is scale implicated.

But whether Euclidean distance is the appropriate similarity metric for your problem, depends on your data. For instance, is it purely numeric (continuous)? Or does it have discrete (categorical) variables as well (e.g., gender? M/F) If one of your dimensions is "current location" and of the 200 users, 100 have the value "San Francisco" and the other 100 have "Boston", you can't really say that, on average, your users are from somewhere in Kansas, but that's sort of what Euclidean distance would do.

In any event, since we don't know anything about it, i'll just give you a simple flow diagram so that you can apply it to your data and identify an appropriate similarity metric.

To identify an appropriate similarity metric given your data:

enter image description here

Selecting an appropriate similarity metric & assessing the validity of a k-means clustering model

4 Answers