0
votes

I'm trying to write a function that loops through a list in order to run kmeans clustering on only specific columns of a dataset. I want the output to be a matrix/dataframe of the cluster membership of each observation when kmeans is run on each set of columns.

Here's a mock dataset and the function I came up with (I'm new to R--sorry if it's shaky)

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
rnorm(100,0,1), d = rnorm(100,0,1), e = rnorm(100,0,1)) 

set.seed(123)
my.kmeans <- function(data,k,...) {
    clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
    length(list(...)))) # set up dataframe for clusters
    for(i in list(...)) {
        kmeans <- kmeans(data[,i],centers = k)
        clusters[,i] <- kmeans$cluster
    }
    colnames(clusters) <- list(...)
    clusters
}

My question is: this seems to work when I only ask it to use consecutive columns, but not when I ask it to skip around some. For instance, the first of the following works, but the second does not. Any idea how I can fix this?

# works how I want 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,3)))

# doesn't work 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,5)))

Also, I know people recommend using apply functions and staying away from for loops, but I don't know how to do this with an apply function. Any advice on that would be much appreciated as well.

Thanks so much!

Danny

1
the problem is in this part of the code clusters[,i] <- kmeans$cluster because i resolves to 5 in your second case - SatZ
Thanks so much @SatZ! Could you explain why i resolves to 5? And how I might get around this? Sorry--I'm pretty new to R. Thanks a lot! - Danny
For anyone who's following (though this is pretty specific so I doubt it), I think I figured it out: you have to change "for(i in list(....))" to "for(i in 1:length(list(...)))"; that way, when you subset with i later, it fills in correctly. Thanks @SatZ - Danny

1 Answers

1
votes

Building on @SatZ's comments,

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
                   rnorm(100,0,1), d = rnorm(100,0,1), e = 
                   rnorm(100,0,1)) 
mylist <- list(c(1,2), c(2,3), c(1,2,5))

set.seed(123)
my.kmeans <- function(data,k,list) {
  clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
                              length(list))) # set up dataframe for 
                              clusters
  for(i in 1:length(list)) {
      kmeans <- kmeans(data[,list[[i]]],centers = k)
      clusters[,i] <- kmeans$cluster
  }
  colnames(clusters) <- list
  clusters
}

head(my.kmeans(data = mydata, k = 8, list = mylist))