How to find the right cluster algorithm?

Question

I would like to find the algorithm which circumvent some drawbacks of k-Means:

Given:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids

plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)

Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.

This wouldn't minimize the within-group inerty, will it? Thus I think you're looking for something that is not clustering. Have you thought about fitting two lines? Or try fitting a gaussian mixture? The true question is: why do you think your groups should be like this? — iago-lito
hmm, its seems like a natural choice. I mean obviously there is some linearity. However I thought to find a cluster algorithm which might approximate my feeling towards the two clusters. — Googme
I think a mixture of Gaussians might work well in this case. The algorithm will automately choose how many clusters you need (here 2) and their correlation structure (here 45°-stretched). I don't know yet the package you'd need, but it shouldn't be too hard to find. — iago-lito

iago-lito iago-lito · Accepted Answer · 2014-11-23T15:38:36

Try Mclust from the mclust package, it will try to fit a Gaussian mixture on your data. The default behavior:

mc = Mclust(matrix);
points(t(mc$parameters$mean));
plot(mc);

.. will find 4 groups, but you might be able to force it to 2 or to force the correlation structure for the Gaussians to be stretched in the right direction.

Be aware that it'll be hard to interpret and justify the meaning of your groups unless you understand very well the reason why you want them to be 2 etc..

How to find the right cluster algorithm?

2 Answers