0
votes

I have a question regarding rpart and overfitting. My goal is only to do well on prediction. My dataset is large, almost 20000 points. Using around 2.5% of these points as training I get a prediction error around 50%. But using 97.5% of the data as training I get around 30%. Since I am using so much data for training I guess there is a risk for overfitting.

I run this 1000 times with random training/test data + pruning the tree which is some sort of cross validation if I have understood it correctly, and I get pretty much stable results (same prediction error and importance of variables).

Can overfitting still be a problem, even though I have run this 1000 times and the prediction error is stable?

I also have a question regarding correlation between my explanatory variables. Can that be a problem in CART (as with regression)? In regression I would maybe use Lasso to try to fix the correlation. How can I fix the correlation with my classification tree?

When I plot the cptree I get this graph:

cptree plot

Here is the code I am running (I have repeated this 1000 times with different random splits each time).

set.seed(1) # For reproducability
train_frac = 0.975
n = dim(beijing_data)[1]

# Split into training and testing data
ii = sample(seq(1,dim(beijing_data)[1]),n*train_frac)
data_train = beijing_data[ii,]
data_test = beijing_data[-ii,]

fit = rpart(as.factor(PM_Dongsi_levels)~DEWP+HUMI+PRES+TEMP+Iws+
              precipitation+Iprec+wind_dir+tod+pom+weekend+month+
              season+year+day,
            data = data_train, minsplit = 0, cp = 0)

plotcp(fit)

# Find the split with minimum CP and prune the tree
cp_fit = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit = prune(fit, cp = cp_fit)
pp <- predict(pfit, newdata = data_test, type = "class")

err = sum(data_test[,"PM_Dongsi_levels"] != pp)/length(pp)
print(err)

Link to beijing_data (as a RData-file so you can reproduce my example) https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0

1

1 Answers

0
votes

The question is quite complex and it will be very hard to comprehensively answer. I will try to provide some insights and references for further reading.

  • Correlated features do not pose a severe problem for tree based methods as they do for models that use a hyper-plane as classification boundaries. When there are multiple correlated features the tree will just pick one and the rest will be ignored. However correlated features often cloud the interpretability of such a model, mask interaction and so on. Tree based models can also benefit from the removal of such variables since they will have to search a lesser space. Here is a decent resource on trees. Also check these videos 1, 2 and 3 and the ISLR book.

  • Models based on one tree tend to not perform as good as hyper plane based methods. So if you are interested mainly in the quality of prediction then you should explore models based on a bunch of trees such as bagging and boosting models. Popular implementations of bagging and boosting in R are randomForest and xgboost. Both can be utilized with little to no experience and can result in good predictions. Here is a resource on how to use the popular R machine learning library caret to tune a random forest. Another resource is the R mlr library which provides great wrappers for many great things related to ML, for instance here is a short blog post on Model based optimization of xgboost.

  • Re-sampling strategy for model validation varies with task and available data. With 20 k rows I would probably use over 50 - 60 % for training, 20 % for validation and 20 -30 % as test set. The 50 % test set I would use to select a suitable ML method, features, hyper parameters and so on by repeated K-fold cross validation (2-3 times repeated 4-5 - fold or similar). The 20 % validation set I would use to fine tune stuff and to get a feel on how good my cross validation on the train set generalizes. When I am satisfied with everything I would use the test set as a final proof I have a good model. Here are some resources on re-sampling: 1, 2, 3 and nested resampling.

In your situation I would use

z <- caret::createDataPartition(data$y, p = 0.6, list = FALSE)
train <- data[z,]
test <- data[-z,]

to split the data to train and test sets, I would then repeat the process to split the test set again with p = 0.5.

On the train data I would use this tutorial on random forests to tune the mtry and ntree parameters (Extend Caret section) using 5 fold repeated cross validation in caret and a grid search.

control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)

tunegrid <- expand.grid(.mtry = c(1:15), .ntree = c(200, 500, 700, 1000, 1200, 1500))

and so on, as detailed in the mentioned link.

On a final note, the more data you have to train on, the less likely you are to over-fit.