GridsearchCV and Kfold Cross validation

Question

I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further

Say I declare a GridsearchCV instance as below

from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1) 

param_grid = { 
    'n_estimators': [100, 500, 1000, 1500],
    'max_depth' : [4,5,6,7,8,9,10]
}

CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)

I had below questions :

Say in first iteration n_estimators = 100 and max_depth = 4 is selected for model building.Now will the score for this model be choosen with the help of 10 fold cross-validation ?
- a. My understanding about the process is as follows
  - 1.X_train and y_train will be splitted in to 10 sets.
  - 1. Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say score_list
  - 1. This process will be repeated 9 more times and each of this 9 scores will be added to the score_list to give 10 score in all
  - 1. Finally the average of the score_list will be taken to give a final_score for the model with parameters :n_estimators = 100 and max_depth = 4
- b. The above process will repeated with all other possible combinations of n_estimators and max_depth and each time we will get a final_score for that model
- c. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by CV_rfc.best_params_

Is my understanding about GridSearchCV correct ?

Now say I get best model parameters as {'max_depth': 10, 'n_estimators': 100}. I declare an intance of the model as below

RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)

I now have two options which of it is correct is what I wanted to know

a. Use cross validation for entire dataset to see how well the model is performing as below

scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
   rm_score = -scores
   rm_score = np.sqrt(rm_score)

b. Fit the model on X_train, y_train and then test in on X_test, y_test

RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))

Or both of them are correct

1. Yes, your understanding is correct. 2. I would lean more on b, though you're not really doing more tests with the model. Definitely not a because there's no point in doing CV after you find the best model already. Train using the best model on the entire training set, and predict on X_test. — Scratch'N'Purr
I have an answer (and other answers in that answer) which explain about this here. — Vivek Kumar
@Scratch'N'Purr But If I go with 2 b then wont the best model be trained only on X_train and y_train. Wont the model coefficients be biased towards X_train.Forgive me if my understanding is wrong. New in the field of machine learning — Rookie_123
@Scratch'N'Purr Note that I am considering X, y in 2 rather than X_train, y_train — Rookie_123

desertnaut desertnaut · Accepted Answer · 2018-11-26T12:54:47

Regarding (1), your understanding is indeed correct; a wording detail to be corrected in principle is "better final_score" instead of "higher", as there are several performance metrics (everything measuring the error, such as MSE, MAE etc) that are the-lower-the-better.

Now, step (2) is more tricky; it requires taking a step back to check the whole procedure...

To start with, in general CV is used either for parameter tuning (your step 1) or for model assessment (i.e. what you are trying to do in step 2), which are different things indeed. Splitting from the very beginning your data into training & test sets as you have done here, and then sequentially performing the steps 1 (for parameter tuning) and 2b (model assessment in unseen data) is arguably the most "correct" procedure in principle (as for the bias you note in the comment, this is something we have to live with, since by default all our fitted models are "biased" toward the data used for their training, and this cannot be avoided).

Nevertheless, since early on, practitioners have been wondering if they can avoid "sacrificing" a part of their precious data only for testing (model assessment) purposes, and trying to see if they can actually skip the model assessment part (and the test set itself), using as model assessment the best results obtained from the parameter tuning procedure (your step 1). This is clearly cutting corners, but, as usually, the question is how off the actual results will be? and will it still be meaninful?

Again, in theory, what Vivek Kumar writes in his linked answer is correct:

If you use the whole data into GridSearchCV, then there would be leakage of test data into parameter tuning and then the final model may not perform that well on newer unseen data.

But here is a relevant excerpt of the (highly recommended) Applied Predictive Modeling book (p. 78):

In short: if you use the whole X in step 1 and consider the results of the tuning as model assessment, there will indeed be a bias/leakage, but it is usually small, at least for moderately large training sets...

Wrapping-up:

The "most correct" procedure in theory is indeed the combination of your steps 1 and 2b
You can try to cut corners, using the whole training set X in step 1, and most probably you will still be within acceptable limits regarding your model assessment.

GridsearchCV and Kfold Cross validation

1 Answers