What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

Question

I'm confused with the results, probably I'm not getting the concept of cross validation and GridSearch right. I had followed the logic behind this post: https://randomforests.wordpress.com/2014/02/02/basics-of-k-fold-cross-validation-and-gridsearchcv-in-scikit-learn/

argd = CommandLineParser(argv)
folder,fname=argd['dir'],argd['fname']

df = pd.read_csv('../../'+folder+'/Results/'+fname, sep=";")

explanatory_variable_columns = set(df.columns.values)
response_variable_column = df['A']
explanatory_variable_columns.remove('A')
y = np.array([1 if e else 0 for e in response_variable_column])

X =df[list(explanatory_variable_columns)].as_matrix()

kf_total = KFold(len(X), n_folds=5, indices=True, shuffle=True, random_state=4)

dt=DecisionTreeClassifier(criterion='entropy')

min_samples_split_range=[x for x in range(1,20)]
dtgs=GridSearchCV(estimator=dt, param_grid=dict(min_samples_split=min_samples_split_range), n_jobs=1)

scores=[dtgs.fit(X[train],y[train]).score(X[test],y[test]) for train, test in kf_total]
# SAME AS DOING: cross_validation.cross_val_score(dtgs, X, y, cv=kf_total, n_jobs = 1)

print scores
print np.mean(scores)
print dtgs.best_score_

RESULTS OBTAINED:

# score [0.81818181818181823, 0.78181818181818186, 0.7592592592592593, 0.7592592592592593, 0.72222222222222221]
# mean score 0.768
# .best_score_ 0.683486238532

ADDITIONAL NOTE:

I ran it using another combination of the explanatory variables (using only some of them) and I got the inverse problem. Now the .best_score_ is higher than all the values in the cross validation array.

# score [0.74545454545454548, 0.70909090909090911, 0.79629629629629628, 0.7407407407407407, 0.64814814814814814]
# mean score 0.728
# .best_score_ 0.802752293578

hellpanderr hellpanderr · Accepted Answer · 2015-09-17T15:47:40

The code is confusing several things. dtgs.fit(X[train_],y[train_]) does internal 3-fold cross-validation for every parameter combination from param_grid, producing a grid of 20 results, which you can open by calling dtgs.grid_scores_.

[dtgs.fit(X[train_],y[train_]).score(X[test],y[test]) for train_, test in kf_total] Therefore this line fits grid search five times and then takes its score using 5-Fold cross validation. The result is the array of scores of 5-Fold validation.

And when you call dtgs.best_score_ you get the best score in the grid of the results of 3-fold validation of hyperparameters for the last fit (of 5).

What is the meaning of the GridSearchCV best_score_ attribute? (the value is different from the mean of the cross validation array)

RESULTS OBTAINED:

ADDITIONAL NOTE:

1 Answers