I have a cross-section data set repeated for 2 years, 2009 and 2010. I am using the first year (2009) as a training set to train a Random Forest for a regression problem and the second year (2010) as a test set.
Load the data
df <- read.csv("https://www.dropbox.com/s/t4iirnel5kqgv34/df.cv?dl=1")
After training the Random Forest for 2009 the variable importance indicates the variable x1
is the most important one.
Random Forest using all variables
set.seed(89)
rf2009 <- randomForest(y ~ x1 + x2 + x3 + x4 + x5 + x6,
data = df[df$year==2009,],
ntree=500,
mtry = 6,
importance = TRUE)
print(rf2009)
Call:
randomForest(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6, data = df[df$year == 2009, ], ntree = 500, mtry = 6, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 5208746
% Var explained: 75.59
Variable importance
imp.all <- as.data.frame(sort(importance(rf2009)[,1],decreasing = TRUE),optional = T)
names(imp.all) <- "% Inc MSE"
imp.all
% Inc MSE
x1 35.857840
x2 16.693059
x3 15.745721
x4 15.105710
x5 9.002924
x6 6.160413
I then move on to the test set and I receive the following accuracy metrics.
Prediction and evaluation on the test set
test.pred.all <- predict(rf2009,df[df$year==2010,])
RMSE.forest.all <- sqrt(mean((test.pred.all-df[df$year==2010,]$y)^2))
RMSE.forest.all
[1] 2258.041
MAE.forest.all <- mean(abs(test.pred.all-df[df$year==2010,]$y))
MAE.forest.all
[1] 299.0751
When I then train the model without the variable x1
, which was the most important one as per the above, and apply the trained model on the test set, I observe the following:
the variance explained with
x1
is higher than withoutx1
as expectedbut the
RMSE
for the test data is better withoutx1
(RMSE
: 2258.041 withx1
vs. 1885.462 withoutx1
)nevertheless
MAE
is slightly better withx1
(299.0751) vs. without it (302.3382).
Random Forest excluding x1
rf2009nox1 <- randomForest(y ~ x2 + x3 + x4 + x5 + x6,
data = df[df$year==2009,],
ntree=500,
mtry = 5,
importance = TRUE)
print(rf2009nox1)
Call:
randomForest(formula = y ~ x2 + x3 + x4 + x5 + x6, data = df[df$year == 2009, ], ntree = 500, mtry = 5, importance = TRUE)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 5
Mean of squared residuals: 6158161
% Var explained: 71.14
Variable importance
imp.nox1 <- as.data.frame(sort(importance(rf2009nox1)[,1],decreasing = TRUE),optional = T)
names(imp.nox1) <- "% Inc MSE"
imp.nox1
% Inc MSE
x2 37.369704
x4 11.817910
x3 11.559375
x5 5.878555
x6 5.533794
Prediction and evaluation on the test set
test.pred.nox1 <- predict(rf2009nox1,df[df$year==2010,])
RMSE.forest.nox1 <- sqrt(mean((test.pred.nox1-df[df$year==2010,]$y)^2))
RMSE.forest.nox1
[1] 1885.462
MAE.forest.nox1 <- mean(abs(test.pred.nox1-df[df$year==2010,]$y))
MAE.forest.nox1
[1] 302.3382
I am aware that the variable importance refers to the training model and not to the test one, but does this mean that the x1
variable should not be included in the model?
So, should I include x1
in the model?