How do I replace the bootstrap step in the package randomForest r

Question

First some background info, which is probably more interesting on stats.stackexchange:

In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).

I use time series data where the explanatory variables are lagged values of other data and the dependent variable.

For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.

To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.

Now my problem is this:

To achieve this I would normally copy the existing function and edit the desired step/lines.

randomForest2 <- randomForest()

But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?

you may want to go grab the original function and start from there - something like randomForest <- randomForest:::randomForest.formula. How big is the data? Maybe you could remove the time series element bit by including the lagged values as independant variables, then using a standard approach? — RichAtMango
I'm not sure the strata argument will achieve what you want, probably not. Otherwise, the resampling all takes place in compiled C (regression) or Fortran (classification) code, so to modify that you'd need to download the source, alter it and recompile. — joran
What if you tried it with replace==FALSE and sampsize equal to your training data? That would get rid of the bootstrap entirely and you'd basically end up with a set of bagged trees. — Tchotchke
Thank you for your suggestions, I will check them out tomorrow and update accordingly. — DaReal

DaReal DaReal · Accepted Answer · 2015-08-22T18:36:53

So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).

The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.

For the code below the performance seems to be a coin-toss.

Taking Soren's code as an example it looks a bit like this:

library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
library(rpart)

#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)

#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2

#past to present orientation    
y = sin((1:timepoints)*pi/30) * 1000 +
  sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")

#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)

#compute lag 
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1 
dX = sapply(1:40,function(i){
  getTheseLags = (1:timepoints) - i
  getTheseLags[getTheseLags<1] = NA #remove before start timePoints
  dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure

#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test  = dy[-train]
dX.train  = dX[ train,]
dX.test   = dX[-train,]

#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)

#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
  lVector = length(aVector)
  mod = lVector %% splits
  lBlocks = rep(floor(lVector/splits),splits)
  if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
  lapply(1:splits,function(i) {
    Stop  = sum(lBlocks[1:i])
    Start = Stop - lBlocks[i] + 1
    aVector[Start:Stop]
  })
}  


#create a list of block-wise bootstrapped samples
aBlock <- list()
numTrees <- 500
splits <- 40
for (ttt in 1:numTrees){

  aBlock[[ttt]] <- unlist(
    sample(
      split2(1:nrow(dX.train),splits=splits),
      splits,
      replace=T
    )
  )
}

#put data into a dataframe so rpart understands it
df1 <- data.frame(dy.train, dX.train)
#perform regression trees for Blocks
rfBlocks = foreach(aBlock = aBlock,
                   .packages=("rpart")) %dopar% {
                     dBlock = df1[aBlock,] 
                     rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) ) 
                   } 

#predict test, make results table
#use rowMeans to aggregate the block-wise predictions
results = data.frame(predBlock   = rowMeans(do.call(cbind.data.frame, rfBlocks)),
                     true=dy.test,
                     predBootstrap = predict(rf,newdata=dX.test)
                     )
plot(results[,1:2],xlab="OOB-CV predicted change",
     ylab="trueChange",
     main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
       ylab="trueChange",
       col="blue")

#prediction results
print(cor(results)^2)


stopCluster(cl)#close cluster

How do I replace the bootstrap step in the package randomForest r

2 Answers