6
votes

I am working on binary class random forest with approximately 4500 variables. Many of these variables are highly correlated and some of them are just quantiles of an original variable. I am not quite sure if it would be wise to apply PCA for dimensionality reduction. Would this increase the model performance?

I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

Many thanks in advance.

2
For the interested reader, this may be useful too as it gives a different perspective: stats.stackexchange.com/questions/258938/…Anonymous

2 Answers

6
votes

My experience is that PCA before RF is not an great advantage if any. Principal component regression(PCR) is e.g. when, PCA assists to regularize training features before OLS linear regression and that is very needed for sparse data-sets. As RF itself already performs a good/fair regularization without assuming linearity, it is not necessarily an advantage. That said, I found my self writing a PCA-RF wrapper for R two weeks ago. The code includes a simulated data set of a data set of 100 features comprising only 5 true linear components. Under such cercumstances it is infact a small advantage to pre-filter with PCA The code is a seamless implementation, such that every RF parameters are simply passed on to RF. Loading vector are saved in model_fit to use during prediction.

@I would like to be able to know which variables are more significant to my model, but if I use PCA, I would be only able to tell what PCs are more important.

The easy way is to run without PCA and obtain variable importances and expect to find something similar for PCA-RF.

The tedious way, wrap the PCA-RF in a new bagging scheme with your own variable importance code. Could be done in 50-100 lines or so.

The souce-code suggestion for PCA-RF:

#wrap PCA around randomForest, forward any other arguments to randomForest
#define as new S3 model class
train_PCA_RF = function(x,y,ncomp=5,...) {
  f.args=as.list(match.call()[-1])
  pca_obj = princomp(x)
  rf_obj = do.call(randomForest,c(alist(x=pca_obj$scores[,1:ncomp]),f.args[-1]))
  out=mget(ls())
  class(out) = "PCA_RF"
  return(out)    
}

#print method
print.PCA_RF = function(object) print(object$rf_obj)

#predict method
predict.PCA_RF = function(object,Xtest=NULL,...) {
  print("predicting PCA_RF")
  f.args=as.list(match.call()[-1])
  if(is.null(f.args$Xtest)) stop("cannot predict without newdata parameter")
  sXtest = predict(object$pca_obj,Xtest) #scale Xtest as Xtrain was scaled before
  return(do.call(predict,c(alist(object = object$rf_obj, #class(x)="randomForest" invokes method predict.randomForest
                                 newdata = sXtest),      #newdata input, see help(predict.randomForest)
                                 f.args[-1:-2])))  #any other parameters are passed to predict.randomForest

}

#testTrain predict #
make.component.data = function(
  inter.component.variance = .9,
  n.real.components = 5,
  nVar.per.component = 20,
  nObs=600,
  noise.factor=.2,
  hidden.function = function(x) apply(x,1,mean),
  plot_PCA =T
){
  Sigma=matrix(inter.component.variance,
               ncol=nVar.per.component,
               nrow=nVar.per.component)
  diag(Sigma)  = 1
  x = do.call(cbind,replicate(n = n.real.components,
                              expr = {mvrnorm(n=nObs,
                                              mu=rep(0,nVar.per.component),
                                              Sigma=Sigma)},
                              simplify = FALSE)
            )
  if(plot_PCA) plot(prcomp(x,center=T,.scale=T))
  y = hidden.function(x)
  ynoised = y + rnorm(nObs,sd=sd(y)) * noise.factor
  out = list(x=x,y=ynoised)
  pars = ls()[!ls() %in% c("x","y","Sigma")]
  attr(out,"pars") = mget(pars) #attach all pars as attributes
  return(out)
}

A run code example:

#start script------------------------------
#source above from separate script
#test
library(MASS)
library(randomForest)

Data = make.component.data(nObs=600)#plots PC variance
train = list(x=Data$x[  1:300,],y=Data$y[1:300])
test = list(x=Data$x[301:600,],y=Data$y[301:600])

rf = randomForest (train$x, train$y,ntree =50) #regular RF
rf2 = train_PCA_RF(train$x, train$y,ntree= 50,ncomp=12)

rf
rf2


cat("rf, R^2:",cor(test$y,pred_rf  )^2,"PCA_RF, R^2", cor(test$y,pred_rf2)^2)

pred_rf = predict(rf  ,test$x)
pred_rf2 = predict(rf2,test$x)

cor(test$y,predict(rf ,test$x))^2
cor(test$y,predict(rf2,test$x))^2

pairs(list(trueY = test$y,
           native_rf = pred_rf,
           PCA_RF = pred_rf2)
)
2
votes

You can have a look here to get a better idea. The link says use PCA for smaller datasets!! Some of my colleagues have used Random Forests for the same purpose when working with Genomes. They had ~30000 variables and large amount of RAM.

Another thing I found is that Random Forests use up a lot of Memory and you have 4500 variables. So, may be you could apply PCA to the individual Trees.