I have a set of 2D points, where I would like to use the K-means algorithm to partition out the correct number of clusters.
I read that that for a fixed number of clusters, I should run it a few times and find the result that gives the minimum variance.
For example, say I know that the "correct" number of clusters to be 4. Thus the pseudo-code for this example:
List<kmeansResult> result;
for(int i = 0 ; i < numIteration; ++i)
{
result.Add(kmeans.Compute(4));
}
And I will get 10 different ways of obtaining 4 clusters in result
, each with its individual cluster variances.
My question in this case, is how to quantify "minimum" variance. Since the variance is in 2 dimensions, i.e. var(X)
and var(Y)
, there may be instances where var(X)
is mimimized but not var(Y)
. What would be a good measure to "lump" the 2 together?
covar(X,Y) = 0
, which might be true for the entire sample set, but most likely is not true per cluster. – amitvar(X+Y)
, you need to account for the covariance of the two random variables,var(X+Y) = var(X) + var(Y) +2cov(X,Y)
, so if you want the joined variance, you need to take care of the covariance. – amitvar(X+Y)
would help here. I was going forvar(X) + var(Y)
which is basicallyvar( || data point - cluster center ||)
– Nico Schertler