0
votes

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?

Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).

In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?

Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?

1

1 Answers

0
votes

The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):

$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$

Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:

km <- kmeans(iris[,-5], 3)
print(km$centers)

If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:

nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension

# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)

for (i in 1:nstart) {
  centers <- data.frame(runif(K, r.min[d], r.max[d]))
  for (d in 2:D) {
    centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
  }
  names(centers) <- names(iris[,-5])

  # call kmeans with centers and compare tot.withinss
  # ...
}