Writing own kmeans algorithm in R

Question

I am trying to write my first own kmeans algorithm in R. I am new in this field, so please don't judge me for don't seeing the obvious.

In its current state, the algorithm takes two vectors x, y, calculates the distance of each data point to the cluster centers and assigns the cluster with minimal distance from its center to the data point. The algorithm stops when there is no change in the assignment and thus no change in the cluster centers.

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function
kclus <- function(x, y, nclus) {

    # start with random cluster centers
    xcen <- runif(n = nclus, min = min(x), max = max(x))   
    ycen <- runif(n = nclus, min = min(y), max = max(y))

    # data points and cluster assignment in "data"
    # cluster coordinates in "clus"
    data <- data.frame(xval = x, yval = y, clus = NA)
    clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

    finish <- FALSE

    while(finish == FALSE) {

        # assign cluster with minimum distance to each data point
        for(i in 1:length(x)) {
            dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
            data$clus[i] <- which.min(dist)
        }

        xcen_old <- clus$xcen
        ycen_old <- clus$ycen

        # calculate new cluster centers
        for(i in 1:nclus) {
            clus[i,2] <- mean(subset(data$xval, data$clus == i))
            clus[i,3] <- mean(subset(data$yval, data$clus == i))
        }

        # stop the loop if there is no change in cluster coordinates
        if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
    }
    data
}

# apply kmeans function to sample data
cluster <- kclus(xval, yval, 4)

# plot the result
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + geom_point()

This is working relatively good so far. But I have no clue, how I can force the algorithm to a specific number of clusters. It is already implemented as the parameter nclus in my kclus() function, but I don't know how to make use of it.

For the given sample data, the algorithm just gives me three clusters. I want to force him to give me four clusters back.

Anybody here who can give me an advice on that?

Thank you so much, Marcus

Sandipan Dey Sandipan Dey · Accepted Answer · 2017-01-28T18:17:35

This is not true that the algorithm you implemented always gives you 3 clusters, probably you have not run it for sufficiently large number of times. Here is slight modification of your code where we shall be able to see that the number of clusters output depends on the initialization of the cluster centroids (that are randomly chosen and can be controlled with random.seed):

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function with random.seed for initialization
kclus <- function(x, y, nclus, random.seed=123) {

  set.seed(random.seed)
  # start with random cluster centers
  xcen <- runif(n = nclus, min = min(x), max = max(x))   
  ycen <- runif(n = nclus, min = min(y), max = max(y))

  # data points and cluster assignment in "data"
  # cluster coordinates in "clus"
  data <- data.frame(xval = x, yval = y, clus = NA)
  clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

  finish <- FALSE

  while(finish == FALSE) {

    # assign cluster with minimum distance to each data point
    for(i in 1:length(x)) {
      dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
      data$clus[i] <- which.min(dist)
    }

    xcen_old <- clus$xcen
    ycen_old <- clus$ycen

    # calculate new cluster centers
    for(i in 1:nclus) {
      clus[i,2] <- mean(subset(data$xval, data$clus == i))
      clus[i,3] <- mean(subset(data$yval, data$clus == i))
    }

    # stop the loop if there is no change in cluster coordinates
    if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
  }
  data
}

# with default random seed 123, you should be able to reproduce the result
# as you can see, in this case, no data points were assigned to the 4th cluster
cluster <- kclus(xval, yval, 4)
cluster.centers <- aggregate(.~clus, cluster, mean)
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
  geom_point(size=5) + 
  geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

# run with a different random seed = 12
# as you can see, in this case, the algorithm outputs 4 clusters, with the 2nd cluster having a single datapoint assigned to
    cluster <- kclus(xval, yval, 4, 12)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

# run with a different random seed = 12345
# as you can see, in this case, the algorithm outputs 2 clusters, with the all the datapoints assigned to the 1st and the 2nd cluster
    cluster <- kclus(xval, yval, 4, 12345)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

As we can see from above examples, whether or not a clusters ends up with no points assigned to it at convergence depends upon the initial center positions and also the data distribution. In general, if kmeans ends up with one cluster centroid as empty, it means that if you try to forcefully assign one point to the empty cluster, it will likely result in poorer quality clusters, something that you don't want to do.

There are several things that you can try at this point.

First is you can run your algorithm multiple times, each time with different randomly initialized centers and then choose the result with the highest cluster qualities (measured by SSE etc.).
Second thing that you can try is smarter initialization with Kmeans++.
A not-so-good-choice could be to modify your algorithm to ensure while reassignment of clusters it guarantees that each of the k (=4) clusters has at least one point asigned to it (if not then don't reassign).
Finally you could try some other algorithm such as hierarchical clustering that gives you more flexibility via dendograms to choose as many clusters as you want.

Writing own kmeans algorithm in R

3 Answers