5
votes

I am wondering what other people are doing with K-means cluster ordering. I am making heatmaps (mainly of ChIP-Seq data) and getting nice looking figures with a custom heatmap function (based off of R's built in heatmap function). However, I'd like two improvements. The first is to order my clusters based on decreasing average value. For instance, the following code:

fit = kmeans(data, 8, iter.max=50, nstart=10)
d = data.frame(data, symbol)
d = data.frame(d, fit$cluster)
d = d[order(d$fit.cluster),]

gives me a data.frame ordered on a clusters column. What is the best way to order the rows such that the 8 clusters are in order of their respective means?

Second, do you recommend sorting the rows WITHIN each cluster from highest mean value to lowest? This will impose a more organized look onto the data, but may fool a non-cautious observer into inferring something that he perhaps should not. If you do recommend this, how would you do it most efficiently?

1
means of what? One of the variables used for clustering or something else?Gavin Simpson
Means of the values in each cluster. For instance, if each cluster is 30 rows in a data.frame and the data.frame has 10 columns upon which k-means clustering is performed, I'd want the mean of the 300 values in each cluster. Could also use the centroid.Ron Gejman
The centroid is isn't a number for each cluster, it is a point in 10-d space and hence each cluster centroid has 10 coordinates.Gavin Simpson

1 Answers

4
votes

Not an exact answer to what you ask, but perhaps you might consider seriation instead of k-means clustering. It is a bit like ordination rather than clustering, but one end result is a heatmap of the seriated data which sounds similar to what you seem to be doing with k-means followed by a specifically ordered heatmap.

There is an R package for seriation, called seriation and it has a vignette, which you can get directly from CRAN

I'll answer the specifics of the Q once I've cooked up an example to try.

Ok - proper answer following on from your comment above. First some dummy data - 3 clusters of 10 samples each, on each of 3 variables.

set.seed(1)
dat <- data.frame(A = c(rnorm(10, 2), rnorm(10, -2), rnorm(10, -2)),
                  B = c(rnorm(10, 0), rnorm(10, 5), rnorm(10, -2)),
                  C = c(rnorm(10, 0), rnorm(10, 0), rnorm(10, -10)))

## randomise the rows
dat <- dat[sample(nrow(dat)),]
clus <- kmeans(scale(dat, scale = FALSE), centers = 3, iter.max = 50,
               nstart = 10)

## means of n points in each cluster
mns <- sapply(split(dat, clus$cluster), function(x) mean(unlist(x)))

## order the data by cluster with clusters ordered by `mns`, low to high
dat2 <- do.call("rbind", split(dat, clus$cluster)[order(mns)])

## heatmaps
## original first, then reordered:
layout(matrix(1:2, ncol = 2))
image(1:3, 1:30, t(data.matrix(dat)), ylab = "Observations", 
      xlab = "Variables", xaxt = "n", main = "Original")
axis(1, at = 1:3)
image(1:3, 1:30, t(data.matrix(dat2)), ylab = "Observations", 
      xlab = "Variables", xaxt = "n", main = "Reordered")
axis(1, at = 1:3)
layout(1)

Yielding:

Original and reordered heatmaps