2
votes

I use the randomForest package in R with a rolling window to predict returns on a financial time series (a stock). I have developed a basket of features for this purpose and my goal is to understand their relative predictive power.

My challenge is I cannot use the variable importance feature of random forest because most of my features have a high degree of correlation with their recent past. For example, a moving average spans a window of several days which means that it contains information across several observations in my data set.

This implies that the out-of-bag samples generated by random forest will be correlated with the in-sample features that random forests uses to train my model. Therefore the variable importance I would get from this would be highly optimistic and overfitted.

The solution I see is to somehow compute variable importance on an out-of-sample test set rather than using OOB cross-validation. The goal is to ensure absolutely no correlation with the training set.

My question: does a package exist in R to compute and extract variable importance from a test set rather than the standard OOB cross-validation set? If not, can you suggest an approach to achieve this objective? Thank you for your help.

1

1 Answers

0
votes

Wow, great you thought of these correlations with the in-sample features.

Indeed, running your variable importance on a testset is a better idea. I think you can easily program this yourself. Here's some pseudo-code

check baseline performance metric
for variable in variables:
    random shuffle variable
    check performance metric
    save performance metric to disk
barplot(performance metrics)

Your performance metric can be anything: cross-entropy, accuracy, ROC-AUC and so on