1
votes

I am trying to run a random forest regression on this large dataset in R using the package randomForest. I've run in to problems with computational time required, even when parallelized with doSNOW and 10-20 cores. I think I am misunderstanding the "sampsize" parameter in the function randomForest. When I subset the dataset to 100,000 rows, I can build 1 tree in 9-10 seconds.

training = read.csv("training.csv")
t100K = sample_n(training, 100000)
system.time(randomForest(tree~., data=t100K, ntree=1, importance=T)) #~10sec

But, when I use the sampsize parameter to sample 100,000 rows from the full dataset in the course of running randomForest, the same 1 tree takes hours.

system.time(randomForest(tree~., data=training, sampsize = ifelse(nrow(training<100000),nrow(training), 100000), ntree=1, importance=T)) #>>100x as long. Why?

Obviously, I am eventually going to run >>1 tree. What am I missing here? Thanks.

1

1 Answers

2
votes

Your brackets are slightly off. Notice the difference between the following statements. You currently have:

ifelse(nrow(mtcars<10),nrow(mtcars), 10)

Which counts the number of rows in the boolean matrix mtcars<10 that has TRUE for each element in mtcars that is smaller than 10, and FALSE otherwise. You want:

ifelse(nrow(mtcars)<10,nrow(mtcars), 10)

Hope this helps.