Error looping through list: "Error in `[<-.data.frame`(`tmp`, , i, value = c(7L, 1L, 4L, 7L, 7L, : new columns would leave holes... "

Question

I'm trying to write a function that loops through a list in order to run kmeans clustering on only specific columns of a dataset. I want the output to be a matrix/dataframe of the cluster membership of each observation when kmeans is run on each set of columns.

Here's a mock dataset and the function I came up with (I'm new to R--sorry if it's shaky)

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
rnorm(100,0,1), d = rnorm(100,0,1), e = rnorm(100,0,1)) 

set.seed(123)
my.kmeans <- function(data,k,...) {
    clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
    length(list(...)))) # set up dataframe for clusters
    for(i in list(...)) {
        kmeans <- kmeans(data[,i],centers = k)
        clusters[,i] <- kmeans$cluster
    }
    colnames(clusters) <- list(...)
    clusters
}

My question is: this seems to work when I only ask it to use consecutive columns, but not when I ask it to skip around some. For instance, the first of the following works, but the second does not. Any idea how I can fix this?

# works how I want 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,3)))

# doesn't work 
head(my.kmeans(data = mydata, k = 8, c(1,2), c(2,3), c(1,2,5)))

Also, I know people recommend using apply functions and staying away from for loops, but I don't know how to do this with an apply function. Any advice on that would be much appreciated as well.

Thanks so much!

Danny

the problem is in this part of the code clusters[,i] <- kmeans$cluster because i resolves to 5 in your second case — SatZ
Thanks so much @SatZ! Could you explain why i resolves to 5? And how I might get around this? Sorry--I'm pretty new to R. Thanks a lot! — Danny
For anyone who's following (though this is pretty specific so I doubt it), I think I figured it out: you have to change "for(i in list(....))" to "for(i in 1:length(list(...)))"; that way, when you subset with i later, it fills in correctly. Thanks @SatZ — Danny

Danny Danny · Accepted Answer · 2018-07-11T19:31:31

Building on @SatZ's comments,

set.seed(123)
mydata <- data.frame(a = rnorm(100,0,1), b = rnorm(100,0,1), c = 
                   rnorm(100,0,1), d = rnorm(100,0,1), e = 
                   rnorm(100,0,1)) 
mylist <- list(c(1,2), c(2,3), c(1,2,5))

set.seed(123)
my.kmeans <- function(data,k,list) {
  clusters <- data.frame(matrix(nrow = nrow(data), ncol = 
                              length(list))) # set up dataframe for 
                              clusters
  for(i in 1:length(list)) {
      kmeans <- kmeans(data[,list[[i]]],centers = k)
      clusters[,i] <- kmeans$cluster
  }
  colnames(clusters) <- list
  clusters
}

head(my.kmeans(data = mydata, k = 8, list = mylist))

Error looping through list: "Error in `[<-.data.frame`(`*tmp*`, , i, value = c(7L, 1L, 4L, 7L, 7L, : new columns would leave holes... "

1 Answers

Error looping through list: "Error in `[<-.data.frame`(`tmp`, , i, value = c(7L, 1L, 4L, 7L, 7L, : new columns would leave holes... "