I was trying to understand the sklearn's GridSearchCV. I was having few basic question about the use of cross validation in GridsearchCV and then how shall I use the GridsearchCV 's recommendations further
Say I declare a GridsearchCV instance as below
from sklearn.grid_search import GridSearchCV
RFReg = RandomForestRegressor(random_state = 1)
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
I had below questions :
Say in first iteration
n_estimators = 100
andmax_depth = 4
is selected for model building.Now will thescore
for this model be choosen with the help of 10 fold cross-validation ?a. My understanding about the process is as follows
- 1.
X_train
andy_train
will be splitted in to 10 sets. - Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say
score_list
- Model will be trained on 9 sets and tested on 1 remaining set and its score will be stored in a list: say
- This process will be repeated 9 more times and each of this 9 scores will be added to the
score_list
to give 10 score in all
- This process will be repeated 9 more times and each of this 9 scores will be added to the
- Finally the average of the score_list will be taken to give a final_score for the model with parameters :
n_estimators = 100
andmax_depth = 4
- Finally the average of the score_list will be taken to give a final_score for the model with parameters :
- 1.
b. The above process will repeated with all other possible combinations of
n_estimators
andmax_depth
and each time we will get a final_score for that modelc. The best model will be the model having highest final_score and we will get corresponding best values of 'n_estimators' and 'max_depth' by
CV_rfc.best_params_
Is my understanding about GridSearchCV
correct ?
- Now say I get best model parameters as
{'max_depth': 10, 'n_estimators': 100}
. I declare an intance of the model as below
RFReg_best = RandomForestRegressor(n_estimators = 100, max_depth = 10, random_state = 1)
I now have two options which of it is correct is what I wanted to know
a. Use cross validation for entire dataset to see how well the model is performing as below
scores = cross_val_score(RFReg_best , X, y, cv = 10, scoring = 'mean_squared_error')
rm_score = -scores
rm_score = np.sqrt(rm_score)
b. Fit the model on X_train, y_train and then test in on X_test, y_test
RFReg_best.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
rm_score = np.sqrt(mean_squared_error(y_test, y_pred))
Or both of them are correct
b
, though you're not really doing more tests with the model. Definitely nota
because there's no point in doing CV after you find the best model already. Train using the best model on the entire training set, and predict on X_test. – Scratch'N'Purrb
then wont the best model be trained only onX_train
andy_train
. Wont the model coefficients be biased towardsX_train
.Forgive me if my understanding is wrong. New in the field of machine learning – Rookie_123X, y
in 2 rather thanX_train
,y_train
– Rookie_123