This is a theoretical question for xgb and gradient boosting in general. How can I find out what is the best balance of max_depth and num_rounds or n_estimators. Obviously more max_depth creates complex models which is not recommended in boosting, but hundreds of rounds of boosting can also lead to over fitting the training data. Assuming CV gives me same mean/std for max_depth 5 and num_rounds 1000 vs max_depth 15 and num_rounds 100 - which one should I use when releasing the model for unknown data ?
1 Answers
In theory one could go for providing generalization bounds for these models, but the problem is - they are extremely loose. Thus having smaller upper bound does not really guarantee better scores. In practise, the best approach is to make your generalization estimate more reliable - you are using 10-CV? Use 10x10 CV (ten random shuffles of 10CV), if it still gives no answer, use 100. At some point you will get a winner. Furthermore, if you are actually going to realease model to public maybe expected value is not the best metric? CV usually reports mean value (expected value) - so instead of looking only at this - look at the whole spectrum of results obtained. Two values with the same mean and different std clearly show what to choose. When both means and stds are the same you can look at min of the score (which will capture "worst case" scenario), etc.
To sum up: take a close look at the scores, not just averages - and repeat evaluation multiple times to make this reliable.