I am trying to write my first own kmeans algorithm in R. I am new in this field, so please don't judge me for don't seeing the obvious.
In its current state, the algorithm takes two vectors x
, y
, calculates the distance of each data point to the cluster centers and assigns the cluster with minimal distance from its center to the data point. The algorithm stops when there is no change in the assignment and thus no change in the cluster centers.
# Sample data
set.seed(100)
xval <- rnorm(12, mean = rep(1:3, each = 4), sd = 0.2)
yval <- rnorm(12, mean = rep(c(1,2,1), each = 4), sd = 0.2)
# Kmeans function
kclus <- function(x, y, nclus) {
# start with random cluster centers
xcen <- runif(n = nclus, min = min(x), max = max(x))
ycen <- runif(n = nclus, min = min(y), max = max(y))
# data points and cluster assignment in "data"
# cluster coordinates in "clus"
data <- data.frame(xval = x, yval = y, clus = NA)
clus <- data.frame(name = 1:nclus, xcen = xcen, ycen = ycen)
finish <- FALSE
while(finish == FALSE) {
# assign cluster with minimum distance to each data point
for(i in 1:length(x)) {
dist <- sqrt((x[i]-clus$xcen)^2 + (y[i]-clus$ycen)^2)
data$clus[i] <- which.min(dist)
}
xcen_old <- clus$xcen
ycen_old <- clus$ycen
# calculate new cluster centers
for(i in 1:nclus) {
clus[i,2] <- mean(subset(data$xval, data$clus == i))
clus[i,3] <- mean(subset(data$yval, data$clus == i))
}
# stop the loop if there is no change in cluster coordinates
if(identical(xcen_old, clus$xcen) & identical(ycen_old, clus$ycen)) finish <- TRUE
}
data
}
# apply kmeans function to sample data
cluster <- kclus(xval, yval, 4)
# plot the result
ggplot(cluster, aes(xval, yval, color = as.factor(clus))) + geom_point()
This is working relatively good so far. But I have no clue, how I can force the algorithm to a specific number of clusters. It is already implemented as the parameter nclus
in my kclus()
function, but I don't know how to make use of it.
For the given sample data, the algorithm just gives me three clusters. I want to force him to give me four clusters back.
Anybody here who can give me an advice on that?
Thank you so much, Marcus