2
votes

I am generating species distribution models using Random Forest. These models attempt to predict the probability of occurrence by a species, conditioned on various environmental attributes. For most species, our initial set of potential predictors is somewhere between 10 and 25, and each predictor is represented by a GIS raster file with 460,000,000 cells. Because of the nature of the training data, which I won't go into here, I am actually building multiple RF models (approximately 10 to 100+) based on subsets of the data, and then combining to create my overall model for each species. Actually building the model take relatively little time (a few minutes or less, generally), but using the predict function to produce a raster layer of predicted probability based on this model can take 20+ hours. I suspect that much of this lengthy process is due to reading/writing the large raster files, and that a bottleneck might be hard drive read/write speed.

To provide a little more detail... Once I have my trained model, I am creating a raster stack of the layers representing the predictor layers, via the raster package, and then predicting to that stack, using the predict() function in the raster package. I have a reasonably powerful desktop (Core i7, 3.5GHz, w/ 32 GB of RAM), and the input and output raster files are on the local hard drive, not moving over a network. I saw mbq's answer here with helpful suggestions on speeding up model generation with randomForest, and am looking for similar suggestions for speeding up the predict operation. I can think of a number of things that might help (e.g., growing a smaller number of trees, using one of the libraries for parallel processing), and I plan to test these as time permits, but it's unclear to me whether any of these will have a significant impact if the problem is mostly a read-write bottleneck. I would be grateful for any suggestions.

1
I'll just offer an anecdote that doing math on rasters in this way is VERY slow, so that may indeed be your issue.blindjesse
I did run across the posting about parallelRandomForest. It only appears to support regression, not classification. Also, it's not clear to whether this will really speed things up with prediction, as the emphasis appears to be on training. There may still be a bottleneck with read/write speeds.user13706

1 Answers

1
votes

You might look at the mctune function here. This uses the e1071 package to find the optimal parameters. However, you might be able to tweak it to meet your needs.

  source( './mctune.R')
  rf_ranges = list(ntree=c(seq(1,1000,100),seq(1000,8000,500)),
  mtry=seq(5,15,2))

  set.seed(10)
  tuned.rf = mctune(method = randomForest, train.x = formula1, 
  data = dataframe, tunecontrol = tune.control(sampling = "cross",cross = 5),   
  ranges=rf_ranges,mc.control=list(mc.cores=16, mc.preschedule=T),confusionmatrizes=T )
  save(tuned.rf, file = paste('./tuned_rf.RData',sep='') )

  tuned.rf$best.model
  plot(tuned.rf)

Another option might be to use foreach from the doparallel package (see here). You could assign each subset of data (for a new RF model) to each core:

RF_outputs = foreach(i=1:length(yourdatasubsets), .inorder=F, .package=c(randomForest)) %dopar% {
    set.seed(10)
    rf <- randomForest(formula, data=na.omit(yourdatasubsets[i]), ntree=2000, proximity=T)
    return(rf)
 }

Each trained RF model will be returned to you as part of a list in RF_outputs. So RF_outputs[[1]] would be your 1st trained RF model.