I am using sklearn.model_selection.GridSearchCV and sklearn.model_selection.cross_val_score, and while doing so I faced an unexpected result.
In my example I use the following imports:
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer
from sklearn.metrics import recall_score
from sklearn.model_selection import GridSearchCV
import numpy as np
First, I create a random data set:
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
Next, I define pipeline "generator":
def my_pipeline(C=None):
if C is None:
return Pipeline(
[
('step1', StandardScaler()),
('clf', LinearSVC(random_state=42))
])
else:
return Pipeline(
[
('step1', StandardScaler()),
('clf', LinearSVC(C=C, random_state=42))
])
Next, I set couple of C's to be tested:
Cs = [0.01, 0.1, 1, 2, 5, 10, 50, 100]
Lastly, I would like to check what is the maximal recall_score that can be obtained. Once, I do it using cross_val_score and once directly using GridSearchCV.
np.max(
[
np.mean(
cross_val_score(my_pipeline(C=c), X, y,
cv=3,
scoring=make_scorer(recall_score)
)) for c in Cs])
and:
GridSearchCV(
my_pipeline(),
{
'clf__C': Cs
},
scoring=make_scorer(recall_score),
cv=3
).fit(X, y).best_score_)
In my example, the former yields 0.85997883750571147 and the latter 0.85999999999999999. I was expecting the value to be the same. What did I miss?
I put it all in a gist as well.
Edit: Fixing cv. I replaced cv=3 with StratifiedKFold(n_splits=3, random_state=42) and the results didn't change. As a matter of fact, it seems like cv doesn't influence the result.
random_statein bothGridSearchCVandcross_val_score? - Angus Williams