1
votes

I would like to find the algorithm which circumvent some drawbacks of k-Means:

Given:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

matrix<-cbind(x,y)# defining matrix
Kmeans<-kmeans(matrix,centers=2) # with 3 centroids

plot(x,y,col=Kmeans$cluster,pch=19,cex=2)
points(Kmeans$centers,col=1:3,pch=3,cex=3,lwd=3)

Here I would like have an algorithm clustering the data into two clusters divided by a diagonal from left corner to right corner.

2
This wouldn't minimize the within-group inerty, will it? Thus I think you're looking for something that is not clustering. Have you thought about fitting two lines? Or try fitting a gaussian mixture? The true question is: why do you think your groups should be like this?iago-lito
hmm, its seems like a natural choice. I mean obviously there is some linearity. However I thought to find a cluster algorithm which might approximate my feeling towards the two clusters.Googme
I think a mixture of Gaussians might work well in this case. The algorithm will automately choose how many clusters you need (here 2) and their correlation structure (here 45°-stretched). I don't know yet the package you'd need, but it shouldn't be too hard to find.iago-lito

2 Answers

1
votes

Try Mclust from the mclust package, it will try to fit a Gaussian mixture on your data. The default behavior:

mc = Mclust(matrix);
points(t(mc$parameters$mean));
plot(mc);

.. will find 4 groups, but you might be able to force it to 2 or to force the correlation structure for the Gaussians to be stretched in the right direction.

Be aware that it'll be hard to interpret and justify the meaning of your groups unless you understand very well the reason why you want them to be 2 etc..

0
votes

What you are asking for can be solved in multiple ways. Here are two:

  1. First way is to simply define the separating line of you clusters. Since you know how your points should be grouped (by a line) you can use that.

If you want your line to start at the origin, then simply check if x > y:

x<- c(4,4,5,5,6,7,8,9,9,10,2,3,3,4,5,6,6,7,8,8)
y<- c(2,3,3,4,4,5,5,7,6,8,4,5,6,5,7,8,9,9,9,10)

thePoints <- cbind(x,y)


as.integer(thePoints[,1] > thePoints[,2])
[1] 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0

This will put points above the diagonal (starting at 0) in one group, and others - to another group. Keep in mind that if your line may not go through the origin (0) then you have to modify this example a bit.

  1. Kmeans with correlation distance:

The K-means function:

myKmeans <- function(x, centers, distFun, nItter=10) {
    clusterHistory <- vector(nItter, mode="list")
    centerHistory <- vector(nItter, mode="list")

    for(i in 1:nItter) {
        distsToCenters <- distFun(x, centers)
        clusters <- apply(distsToCenters, 1, which.min)
        centers <- apply(x, 2, tapply, clusters, mean)
        # Saving history
        clusterHistory[[i]] <- clusters
        centerHistory[[i]] <- centers
    }

    list(clusters=clusterHistory, centers=centerHistory)
}

And correlation distance:

myCor <- function(points1, points2) {
    return(1 - ((cor(t(points1), t(points2))+1)/2))
}

theResult <- myKmeans(mat, centers, myCor, 10)

As was also displayed HERE

Here how both solution would look like:

plot(points, col=as.integer(points[,1] > points[,2])+1, main="Using a line", xlab="x", ylab="y")
plot(points, col=theResult$clusters[[10]], main="K-means with correlation clustering", xlab="x", ylab="y")
points(theResult$centers[[10]], col=1:2, cex=3, pch=19)

linevskmeans

So it's more about what kind of distance measure you are using and not about some kind of deficiency of K-means.

You can also find better implementations of K-means with correlation distance for R instead of using the one I provided here.