My setup: I am running a process (=pipeline) in which I run a regression after having selected the relevant variables (after standardizing data - steps I have omitted since they are irrelevant in this instance) that I will optimize through a grid search, as shown below
fold = StratifiedShuffleSplit(n_splits=10, test_size=0.2, random_state=777)
regression_estimator = LogisticRegression(penalty='l2', random_state=777, max_iter=10000, tol=10, solver='newton-cg')
pipeline_steps = [('feature_selection', SelectKBest(f_regression)), ('regression', regression_estimator)]
pipe = Pipeline(steps=pipeline_steps)
feature_selection_k_options = np.arange(1, 33, 3)
param_grid = {'feature_selection__k': feature_selection_k_options}
gs = GridSearchCV(pipe, param_grid=param_grid, scoring='recall', cv=fold), y)
since by default refit=True
in the GridSearchCV
, I am getting the best_estimator by default and I am fine with it. What I am missing is, given this best_estimator, how I am getting the cross validated scores on only the TEST data I split beforehand in the procedure. in fact, there is .score(X, Y)
method but, as the docs dictate ( "Returns the mean accuracy on the given test data and labels" whereas I want what is done through cross_val_score ( the problem is that this procedure re-runs everything and keeps only those results (I want to have all the quantities that come out from this process).
In essence, I want to exctract, from the best estimator, the cross validated score on the test data with a measure of my choosing (or the one already selected in the grid search) and with the CrossValidated algorithm already embedded in my Pipeline
(the StratifiedShuffleSplit
in this case)
do you know how to do it?
