Fitting sklearn GridSearchCV model

Question

I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.

Problem 1

Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.

OR

Should I fit it on X, y to get best parameters.(X, y = entire dataset)

Problem 2

Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters. Now how should I train this new model on ?

Should I train the new model on X_train, y_train or X, y.

Problem 3

If I train new model on X,y then how will I validate the results ?

My code so far

   #Dataframes
    feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']

    X = boston_data[feature_cols]
    y = boston_data['PRICE']

Train Test Split of Data

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)

Grid Search to get best hyperparameters

from sklearn.grid_search import GridSearchCV
param_grid = { 
    'n_estimators': [100, 500, 1000, 1500],
    'max_depth' : [4,5,6,7,8,9,10]
}

CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)

CV_rfc.best_params_ 
#{'max_depth': 10, 'n_estimators': 100}

Train a Model on the max_depth: 10, n_estimators: 100

RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)

RMSE: 2.8139766730629394

I just want some guidance with what the correct steps would be

This is a question about methodology, and not programming, hence more appropriate for Cross Validated (and arguably off-topic here). — desertnaut

FMarazzi FMarazzi · Accepted Answer · 2018-11-23T15:31:51

In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.

You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).

Fitting sklearn GridSearchCV model

2 Answers