I am trying to run a random forest regression on this large dataset in R using the package randomForest. I've run in to problems with computational time required, even when parallelized with doSNOW and 10-20 cores. I think I am misunderstanding the "sampsize" parameter in the function randomForest. When I subset the dataset to 100,000 rows, I can build 1 tree in 9-10 seconds.
training = read.csv("training.csv")
t100K = sample_n(training, 100000)
system.time(randomForest(tree~., data=t100K, ntree=1, importance=T)) #~10sec
But, when I use the sampsize parameter to sample 100,000 rows from the full dataset in the course of running randomForest, the same 1 tree takes hours.
system.time(randomForest(tree~., data=training, sampsize = ifelse(nrow(training<100000),nrow(training), 100000), ntree=1, importance=T)) #>>100x as long. Why?
Obviously, I am eventually going to run >>1 tree. What am I missing here? Thanks.