0
votes

I have a set of 2D points, where I would like to use the K-means algorithm to partition out the correct number of clusters.

I read that that for a fixed number of clusters, I should run it a few times and find the result that gives the minimum variance.

For example, say I know that the "correct" number of clusters to be 4. Thus the pseudo-code for this example:

List<kmeansResult> result;

for(int i = 0 ; i < numIteration; ++i)
{
    result.Add(kmeans.Compute(4));
}

And I will get 10 different ways of obtaining 4 clusters in result, each with its individual cluster variances.

My question in this case, is how to quantify "minimum" variance. Since the variance is in 2 dimensions, i.e. var(X) and var(Y), there may be instances where var(X) is mimimized but not var(Y). What would be a good measure to "lump" the 2 together?

1
Just sum them up. This will then be the variance of the distance to the cluster centroid.Nico Schertler
@NicoSchertler You are assuming covar(X,Y) = 0, which might be true for the entire sample set, but most likely is not true per cluster.amit
@amit: I am not sure why the covariance should be accounted for calculating an overall cluster variance. Unless you're looking for elliptical representations. Your answer is basically what I suggested, isn't it?Nico Schertler
@NicoSchertler "My" algorithm (not mine, seen it around bunch of times, including in Machine Learning course in coursera) is summing distance, if you want the variance var(X+Y), you need to account for the covariance of the two random variables, var(X+Y) = var(X) + var(Y) +2cov(X,Y), so if you want the joined variance, you need to take care of the covariance.amit
@amit Agreed. But I don't see how var(X+Y) would help here. I was going for var(X) + var(Y) which is basically var( || data point - cluster center ||)Nico Schertler

1 Answers

0
votes

A common practice is to try to minimize sum { (C_i - x_i) \cdot (C_i - x_i) i=1,...,n}, where:

  • C_i - the central of the cluster for point i
  • x_i - the point i
  • \cdot - dot product
  • n - number of samples

The idea is to minimize the L2 distance from the clusters, resulting in each cluster being tighter together.

In simple words, it is basically summing the euclidean distance for each point to its centroid - but without the squared root taken in the euclidean distance metric.