Feature selection and prediction accuracy in regression Forest in R

Question

I am attempting to solve a regression problem where the input feature set is of size ~54.

Using OLS linear regression with a single predictor 'X1', I am not able to explain the variation in Y - hence I am trying to find additional important features using Regression forest (i.e., Random forest regression). The selected 'X1' is later found to be the most important feature.

My dataset has ~14500 entries. I have separated it into training and test sets in the ratio 9:1.

I have the following questions:

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?
Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?
For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?

forest <- randomForest(fmla, dTraining, ntree=501, importance=T)

mean((dTraining$y - predict(forest, data=dTraining))^2)

0.9371891

rSquared(dTraining$y, dTraining$y - predict(forest, data=dTraining))

0.7431078

mean((dTest$y - predict(forest, newdata=dTest))^2)

0.009771256

rSquared(dTest$y, dTest$y - predict(forest, newdata=dTest))

0.9950448

Please suggest. Any suggestion if R-squared and MSE are good metrics for this problem, or if I need to look at some other metrics to evaluate if the model is good?

CPak CPak · Accepted Answer · 2017-08-29T10:59:19

You should also try Cross Validated here

when trying to find the important features, should I run the regression forest on the entire dataset, or only on the training data?

Only on the training data. You want to prevent overfitting, which is why you do a train-test split in the first place.

Once the important features are found, should the model be re-built using the top few features to see whether feature selection speeds up the computation at a small cost to predictive power?

Yes, but the purpose of feature selection is not necessarily to speed up computation. With infinite features, it is possible to fit any pattern of data (i.e., overfitting). With feature selection, you're hoping to prevent overfitting by using only a few 'robust' features.

For now, I have built the model using the training set and all the features, and I am using it for prediction on the test set. I am calculating the MSE and R-squared from the training set. I am getting high MSE and low R2 on the training data, and reverse on the test data (shown below). Is this unusual?

Yes, it's unusual. You want low MSE and high R2 values for both your training and test data. (I would double check your calculations.) If you're getting high MSE and low R2 with your training data, it means your training was poor, which is very surprising. Also, I haven't used rSquared but maybe you want rSquared(dTest$y, predict(forest, newdata=dTest))?

Feature selection and prediction accuracy in regression Forest in R

1 Answers