6
votes

I have set of data (of 5000 points with 4 dimensions) that I have clustered using kmeans in R.

I want to order the points in each cluster by their distance to the center of that cluster.

Very simply, the data looks like this (I am using a subset to test out various approaches):

id  Ans Acc Que Kudos
1   100 100 100 100
2   85  83  80  75
3   69  65  30  29
4   41  45  30  22 
5   10  12  18  16
6   10  13  10  9
7   10  16  16  19
8   65  68  100 100
9   36  30  35  29
10  36  30  26  22

Firstly, I used the following method to cluster the dataset into 2 clusters:

(result <- kmeans(data, 2))

This returns a kmeans object that has the following methods: cluster, centers etc.

But I cannot figure out how to compare each point and produce an ordered list.

Secondly, I tried the seriation approach as suggested by another SO user here

I use these commands:

clus <- kmeans(scale(x, scale = FALSE), centers = 3, iter.max = 50, nstart = 10)
mns <- sapply(split(x, clus$cluster), function(x) mean(unlist(x)))
result <- dat[order(order(mns)[clus$cluster]), ]

Which seems to produce an ordered list but if I bind it to the labeled clusters (using the following cbind command):

result <- cbind(x[order(order(mns)[clus$cluster]), ],clus$cluster)

I get the following result, which does not appear to be ordered correctly:

id  Ans Acc Que Kudos   clus
1   3   69  65  30  29  1
2   4   41  45  30  22  1
3   5   10  12  18  16  2
4   6   10  13  10  9   2
5   7   10  16  16  19  2
6   9   36  30  35  29  2
7   10  36  30  26  22  2
8   1   100 100 100 100 1
9   2   85  83  80  75  2
10  8   65  68  100 100 2

I don't want to be writing commands willy-nilly but understand how the approach works. If anyone could help out or spread some light on this, it would be really great.

EDIT:::::::::::

As the clusters can be easily plotted, I'd imagine there is a more straightforward way to get and rank the distances between points and the center.

The centers for the above clusters (when using k = 2) are as follows. But I do not know how to get and compare this with each individual point.

     Ans    Accep     Que      Kudos
1 83.33333 83.66667 93.33333 91.66667
2 30.28571 30.14286 23.57143 20.85714 

NB::::::::

I don't need top use kmeans but I want to specify the number of clusters and retrieve an ordered list of points from those clusters.

1
This is a good question...check that you are not using the ID to cluster(I guess there are cases where you may want to, but it is unlikely)Seth
Cool, I don't want to cluster the ids, oversight on my behalf. I will amend the question. thanks..jslotishtype
I believe that kmeans() also returns the final cluster centers. From there, it shouldn't be too hard to compute the distance from each point to the center of its cluster.user554546
Hi @Jack Maney,, you are right, it replies with the cluster means for example: Cluster means: X.Ans X.Accep X.Ques X.Kudos 1 83.33333 83.66667 93.33333 91.66667 2 30.28571 30.14286 23.57143 20.85714 but I don't know how to access the clustered data points to compare distances. I have a feeling that this is quite straightforward but I am unsure of how to proceed.slotishtype
You have the data points (ie the same data that you fed into kmeans()). You have the cluster assignments and centers of each cluster. What, exactly, is confusing you about computing distances between each point and the center of that point's cluster?user554546

1 Answers

6
votes

Here is an example that does what you ask, using the first example from ?kmeans. It is probably not terribly efficient, but is something to build upon.

#Taken straight from ?kmeans
x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
           matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) <- c("x", "y")
cl <- kmeans(x, 2)

x <- cbind(x,cl = cl$cluster)

#Function to apply to each cluster to 
# do the ordering
orderCluster <- function(i,data,centers){
    #Extract cluster and center
dt <- data[data[,3] == i,]
ct <- centers[i,]

    #Calculate distances
dt <- cbind(dt,dist = apply((dt[,1:2] - ct)^2,1,sum))
    #Sort
dt[order(dt[,4]),]
}

do.call(rbind,lapply(sort(unique(cl$cluster)),orderCluster,data = x,centers = cl$centers))