To directly alter sampling of randomForest(type="reggression"): Learn basic C programming, download from cran source code randomForest.4.6-10.tar.gz, (if windows install Rtools), (if OSX install Xcode), install and open Rstudio, start new project, choose package, unpack ...tar.gz into folder, look into src folder, open regrf.c, checkout line 151 and 163. Write new sampling strategy, press occationally Ctrl+Shift+B package to rebuild/compile and overwrite randomForest library, correct stated compile errors, test occasionally if package still works, spend some hours figuring out the old uninformative code, perhaps change description file, namespace file, and some few other references so the package will change name to randomForestMod, rebuild, voilla.
A more easy way not changing the randomForest is described below. Any trees with the same feature inputs can be patched together with the function randomForest::combine, so you can design your sampling regime in pure R code. I thought it actually was a bad idea, but for this very naive simulation it actually works with similar/slightly better performance! Remember to not predict the absolute target value, but instead a stationary derivative such as relative change, absolute change etc. If predicting the absolute value, RF will fall back to predicting tomorrow is something pretty close of today. Which is a trivial useless information.
edited code [22:42 CEST]
library(randomForest)
library(doParallel) #parallel package and mclapply is better for linux
#parallel backend ftw
nCPU = detectCores()
cl = makeCluster(nCPU)
registerDoParallel(cl)
#simulated time series(y) with time roll and lag=1
timepoints=1000;var=6;noise.factor=.2
#past to present orientation
y = sin((1:timepoints)*pi/30) * 1000 +
sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
y = y+rnorm(timepoints,sd=sd(y))*noise.factor
plot(y,type="l")
#convert to absolute change, with lag=1
dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
#compute lag
dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1
dX = sapply(1:40,function(i){
getTheseLags = (1:timepoints) - i
getTheseLags[getTheseLags<1] = NA #remove before start timePoints
dx.lag.i = dy[getTheseLags]
})
dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
#make train- and test-set
train=1:600
dy.train = dy[ train]
dy.test = dy[-train]
dX.train = dX[ train,]
dX.test = dX[-train,]
#classic rf
rf = randomForest(dX.train,dy.train,ntree=500)
print(rf)
#like function split for a vector without mixing
split2 = function(aVector,splits=31) {
lVector = length(aVector)
mod = lVector %% splits
lBlocks = rep(floor(lVector/splits),splits)
if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
lapply(1:splits,function(i) {
Stop = sum(lBlocks[1:i])
Start = Stop - lBlocks[i] + 1
aVector[Start:Stop]
})
}
nBlocks=10 #combine do not support block of unequal size
rfBlocks = foreach(aBlock = split2(train,splits=nBlocks),
.combine=randomForest::combine,
.packages=("randomForest")) %dopar% {
dXblock = dX.train[aBlock,] ; dyblock = dy.train[aBlock]
rf = randomForest(x=dXblock,y=dyblock,sampsize=length(dyblock),
replace=T,ntree=50)
}
print(rfBlocks)
#predict test, make results table
results = data.frame(predBlock = predict(rfBlocks,newdata=dX.test),
true=dy.test,
predBootstrap = predict(rf,newdata=dX.test))
plot(results[,1:2],xlab="OOB-CV predicted change",
ylab="trueChange",
main="black bootstrap and blue block train")
points(results[,3:2],xlab="OOB-CV predicted change",
ylab="trueChange",
col="blue")
#prediction results
print(cor(results)^2)
stopCluster(cl)#close cluster
randomForest <- randomForest:::randomForest.formula
. How big is the data? Maybe you could remove the time series element bit by including the lagged values as independant variables, then using a standard approach? – RichAtMangostrata
argument will achieve what you want, probably not. Otherwise, the resampling all takes place in compiled C (regression) or Fortran (classification) code, so to modify that you'd need to download the source, alter it and recompile. – joranreplace==FALSE
andsampsize
equal to your training data? That would get rid of the bootstrap entirely and you'd basically end up with a set of bagged trees. – Tchotchke