I have a question regarding rpart and overfitting. My goal is only to do well on prediction. My dataset is large, almost 20000 points. Using around 2.5% of these points as training I get a prediction error around 50%. But using 97.5% of the data as training I get around 30%. Since I am using so much data for training I guess there is a risk for overfitting.
I run this 1000 times with random training/test data + pruning the tree which is some sort of cross validation if I have understood it correctly, and I get pretty much stable results (same prediction error and importance of variables).
Can overfitting still be a problem, even though I have run this 1000 times and the prediction error is stable?
I also have a question regarding correlation between my explanatory variables. Can that be a problem in CART (as with regression)? In regression I would maybe use Lasso to try to fix the correlation. How can I fix the correlation with my classification tree?
When I plot the cptree I get this graph:
Here is the code I am running (I have repeated this 1000 times with different random splits each time).
set.seed(1) # For reproducability
train_frac = 0.975
n = dim(beijing_data)[1]
# Split into training and testing data
ii = sample(seq(1,dim(beijing_data)[1]),n*train_frac)
data_train = beijing_data[ii,]
data_test = beijing_data[-ii,]
fit = rpart(as.factor(PM_Dongsi_levels)~DEWP+HUMI+PRES+TEMP+Iws+
precipitation+Iprec+wind_dir+tod+pom+weekend+month+
season+year+day,
data = data_train, minsplit = 0, cp = 0)
plotcp(fit)
# Find the split with minimum CP and prune the tree
cp_fit = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit = prune(fit, cp = cp_fit)
pp <- predict(pfit, newdata = data_test, type = "class")
err = sum(data_test[,"PM_Dongsi_levels"] != pp)/length(pp)
print(err)
Link to beijing_data (as a RData-file so you can reproduce my example) https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0