1
votes

I'm try to do a feature selection on a dataframe list using the caret package. I have different dataframes but the last 6 columns are the same. When I am trying to apply the model on a single df it works fine

# For a single dataframe
mx.chem # the name of my single dataframe
#define the control   
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10) 
# run the rfe     
mx.results <- rfe(mx.chem[,1:22], mx.chem[,23], sizes = c(1:22), rfeControl = mx.control)
print(mex.results)

but My problem is when I try to use lapply on a list of df. The code I have until now is

 require(mlbench)
 require(caret)
 mylist # is a df list containing 3 df 
  for (i in 1:3) {
  my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)  # define the control
  longdata <- length(i)-6
  idxindustry <- longdata +1
  my.results <- lapply(mylist, function(x) rfe ( x[,1:longdata], x[,idxindustry], data = x, sizes =c(1:longdata), rfeControl = my.control))
  }

I'm not sure if I'm using column index properly. Does anyone have an idea how to fix to make my code work. Thanks

2
There does not appear to be a data argument for rfe. Link here. It is currently being passed to the model fitting. Is that intended? - Pierre L
I've just edit my question. My problem is when i want to use lapply. I do not know how to specify a column on the a dataframe contained on a list - mina
longdata <- length(i)-6 is not doing what you think. i takes on three values 1 2 3. So you want longdata to be length(1)-6 and so on? The length of a single number is always 1. So longdata is -5 each time. Do you see why? - Pierre L
Please explain the for loop that is looping through 1 2 3 - Pierre L
Use a for loop or lapply, not both - Pierre L

2 Answers

2
votes

Here are two possible ways:

#Using lapply
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10) 
rfe.lst <- lapply(mylist, 
           function(x) {
               longdata <- ncol(x)-6
               rfe ( x[,1:longdata], x[,longdata + 1], 
                         sizes =c(1:longdata), 
                         rfeControl = mx.control)
               })

#For loop
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10) 
rfe.lst <- vector("list", 3)
for(i in 1:3) {
  longdata <- ncol(mylist[[i]])-6
  rfe.lst[[i]] <- rfe(mylist[[i]][,1:longdata], x[,longdata + 1],
      sizes=c(1:longdata),
      rfeControl=mx.control)
}
0
votes

Your code doesn't do what you think. length(i) will always be 1, because i is your loop index and takes the values 1 to 3. You mean to do:

length(mylist[[i]])

Note the double brackets. That's how you select the element from the list, in this case the data frame. If you use single brackets, you get back a list with the elements you want.

But that's still not what you aim to achieve. If you would change that line in your code, you have 2 loops:

  • an outer loop that creates longdata and idxindustry based on a single data frame each time.
  • an inner lapply loop that uses the values for longdata and idxindustry on all three dataframes.

Remember that lapply takes each element in the list and passes it as the first argument to the function you specify. So you can do this in a single lapply like this:

my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)  

my.results <- lapply(mylist, function(x){
# x becomes one of the data frames in the list mylist here, so you can
# treat it like a data frame in the code below
  longdata <- length(x) - 6
  idxindustry <- longdata +1
  rfe( x[,1:longdata], x[,idxindustry], data = x, 
      sizes =c(1:longdata), rfeControl = my.control)
})

And then you run rfe with longdata and idxindustry based on the data frame at hand. Note I put the call to rfeControl outside the lapply loop for performance.