1
votes

I am trying to write my first own kmeans algorithm in R. I am new in this field, so please don't judge me for don't seeing the obvious.

In its current state, the algorithm takes two vectors x, y, calculates the distance of each data point to the cluster centers and assigns the cluster with minimal distance from its center to the data point. The algorithm stops when there is no change in the assignment and thus no change in the cluster centers.

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function
kclus <- function(x, y, nclus) {

    # start with random cluster centers
    xcen <- runif(n = nclus, min = min(x), max = max(x))   
    ycen <- runif(n = nclus, min = min(y), max = max(y))

    # data points and cluster assignment in "data"
    # cluster coordinates in "clus"
    data <- data.frame(xval = x, yval = y, clus = NA)
    clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

    finish <- FALSE

    while(finish == FALSE) {

        # assign cluster with minimum distance to each data point
        for(i in 1:length(x)) {
            dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
            data$clus[i] <- which.min(dist)
        }

        xcen_old <- clus$xcen
        ycen_old <- clus$ycen

        # calculate new cluster centers
        for(i in 1:nclus) {
            clus[i,2] <- mean(subset(data$xval, data$clus == i))
            clus[i,3] <- mean(subset(data$yval, data$clus == i))
        }

        # stop the loop if there is no change in cluster coordinates
        if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
    }
    data
}

# apply kmeans function to sample data
cluster <- kclus(xval, yval, 4)

# plot the result
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + geom_point()

This is working relatively good so far. But I have no clue, how I can force the algorithm to a specific number of clusters. It is already implemented as the parameter nclus in my kclus() function, but I don't know how to make use of it.

For the given sample data, the algorithm just gives me three clusters. I want to force him to give me four clusters back.

Anybody here who can give me an advice on that?

Thank you so much, Marcus

3

3 Answers

3
votes

This is not true that the algorithm you implemented always gives you 3 clusters, probably you have not run it for sufficiently large number of times. Here is slight modification of your code where we shall be able to see that the number of clusters output depends on the initialization of the cluster centroids (that are randomly chosen and can be controlled with random.seed):

# Sample data    
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)

# Kmeans function with random.seed for initialization
kclus <- function(x, y, nclus, random.seed=123) {

  set.seed(random.seed)
  # start with random cluster centers
  xcen <- runif(n = nclus, min = min(x), max = max(x))   
  ycen <- runif(n = nclus, min = min(y), max = max(y))

  # data points and cluster assignment in "data"
  # cluster coordinates in "clus"
  data <- data.frame(xval = x, yval = y, clus = NA)
  clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)

  finish <- FALSE

  while(finish == FALSE) {

    # assign cluster with minimum distance to each data point
    for(i in 1:length(x)) {
      dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
      data$clus[i] <- which.min(dist)
    }

    xcen_old <- clus$xcen
    ycen_old <- clus$ycen

    # calculate new cluster centers
    for(i in 1:nclus) {
      clus[i,2] <- mean(subset(data$xval, data$clus == i))
      clus[i,3] <- mean(subset(data$yval, data$clus == i))
    }

    # stop the loop if there is no change in cluster coordinates
    if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
  }
  data
}

# with default random seed 123, you should be able to reproduce the result
# as you can see, in this case, no data points were assigned to the 4th cluster
cluster <- kclus(xval, yval, 4)
cluster.centers <- aggregate(.~clus, cluster, mean)
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
  geom_point(size=5) + 
  geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

enter image description here

# run with a different random seed = 12
# as you can see, in this case, the algorithm outputs 4 clusters, with the 2nd cluster having a single datapoint assigned to
    cluster <- kclus(xval, yval, 4, 12)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

enter image description here

# run with a different random seed = 12345
# as you can see, in this case, the algorithm outputs 2 clusters, with the all the datapoints assigned to the 1st and the 2nd cluster
    cluster <- kclus(xval, yval, 4, 12345)
    cluster.centers <- aggregate(.~clus, cluster, mean)
    ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + 
      geom_point(size=5) + 
      geom_point(data=cluster.centers, aes(xval, yval, col=as.factor(clus)), pch=8, size=5)

enter image description here

As we can see from above examples, whether or not a clusters ends up with no points assigned to it at convergence depends upon the initial center positions and also the data distribution. In general, if kmeans ends up with one cluster centroid as empty, it means that if you try to forcefully assign one point to the empty cluster, it will likely result in poorer quality clusters, something that you don't want to do.

There are several things that you can try at this point.

  1. First is you can run your algorithm multiple times, each time with different randomly initialized centers and then choose the result with the highest cluster qualities (measured by SSE etc.).
  2. Second thing that you can try is smarter initialization with Kmeans++.
  3. A not-so-good-choice could be to modify your algorithm to ensure while reassignment of clusters it guarantees that each of the k (=4) clusters has at least one point asigned to it (if not then don't reassign).
  4. Finally you could try some other algorithm such as hierarchical clustering that gives you more flexibility via dendograms to choose as many clusters as you want.
2
votes

That is just the way that k-means works. You have two main choices. Either live with getting fewer clusters or whenever the number of clusters falls below the requested number of clusters, start a new one. To start a new one, one might find the point that is farthest from its cluster center and change it to be a new cluster. However, there are problems with this. Suppose that you have 20 points and the user asks for 25 clusters. You just can't satisfy some people.

0
votes

The problem is your initialization.

Initializing with random numbers is the worst possible choice unless your data is uniformly random distributed (and then you don't have clusters).

Now if you generate a center in the top left corner, it may have 0 points, and your code likely then generates a NaN mean next.

Instead, try choosing k points from your data as centers. This is much less likely to go bad (although it can).