Getting proper cross validation scores with grid search and pipelines in sklearn

Question

My setup: I am running a process (=pipeline) in which I run a regression after having selected the relevant variables (after standardizing data - steps I have omitted since they are irrelevant in this instance) that I will optimize through a grid search, as shown below

fold = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=777)
regression_estimator = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10, solver='newton-cg')
pipeline_steps = [('feature_selection', SelectKBest(f_regression)), ('regression', regression_estimator)]

pipe = Pipeline(steps=pipeline_steps)

feature_selection_k_options = np.arange(1, 33, 3)

param_grid = {'feature_selection__k': feature_selection_k_options}

gs = GridSearchCV(pipe, param_grid=param_grid, scoring='recall', cv=fold)
gs.fit(X, y)

since by default refit=True in the GridSearchCV, I am getting the best_estimator by default and I am fine with it. What I am missing is, given this best_estimator, how I am getting the cross validated scores on only the TEST data I split beforehand in the procedure. in fact, there is .score(X, Y) method but, as the docs dictate (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.predict_proba) "Returns the mean accuracy on the given test data and labels" whereas I want what is done through cross_val_score (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). the problem is that this procedure re-runs everything and keeps only those results (I want to have all the quantities that come out from this process).

In essence, I want to exctract, from the best estimator, the cross validated score on the test data with a measure of my choosing (or the one already selected in the grid search) and with the CrossValidated algorithm already embedded in my Pipeline (the StratifiedShuffleSplit in this case)

do you know how to do it?

Please explain in more detail (probably a pseudo code of some kind) as to what you want to do? Currently its very confusing. Best estimator is the estimator initialized with the best found param combinations. All param combinations with their train test results on all folds can be accessed from cv_results_. This can be done for any number of metrics you want. — Vivek Kumar

Marcus V. Marcus V. · Accepted Answer · 2018-06-06T18:58:02

You can access the cross validation score through the cv_results_ attribute which can be read conviniently into a pandas DataFrame:

import pandas as pd
df_result = pd.DataFrame(gs.cv_results_)

Regarding "with a measure of my choosing", you can check out this example showing how multiple scorers can be calculated at once within GridSearchCV.

Getting proper cross validation scores with grid search and pipelines in sklearn

1 Answers