1
votes

I would like to calculate a confidence interval for the RMSE of a machine learning regression in the out-of-sample test set predictions.

My train set is the first 80% of the sample, and the "out-of-sample" test set is the last 20% of the sample. I treat the RMSE of the test set predictions as the out-of-sample performance, and would like to calculate a CI of this RMSE.

One idea I had was to resample the train set among the first 80%, but use the same test set each iteration. This would seem to represent the CI of RMSE on the test set across different possible training scenarios. However, it would not account for possible variation in the test set.

Is this approach sensible? Is there a better way to address my question? Thanks!

1

1 Answers

0
votes

Is there a reason you want to fix the test set to be that exact sample of observations?

One approach would be to repeatedly split the dataset into training and test sets at the 80-20 proportion you are currently using. After each random (with replacement) split, proceed as usual. That is, train your model, then calculate the RMSE on the test data. You can perform, say, 10,000 bootstraps of this form, save the associated RMSE values, and calculate the confidence interval of these values.

See, e.g., Chapter 5 of Hastie et al.