GridSearchCV on a working pipeline returns ValueError

Question

I am using GridSearchCV in order to find the best parameters for my pipeline.

My pipeline seems to work well as I can apply:

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

And I get a decent result.

But GridSearchCV obviously doesn't like something, and I cannot figure it out.

My pipeline:

feats = FeatureUnion([('age', age),
                      ('education_num', education_num),
                      ('is_education_favo', is_education_favo),
                      ('is_marital_status_favo', is_marital_status_favo),
                      ('hours_per_week', hours_per_week),
                      ('capital_diff', capital_diff),
                      ('sex', sex),
                      ('race', race),
                      ('native_country', native_country)
                     ])

pipeline = Pipeline([
        ('adhocFC',AdHocFeaturesCreation()),
        ('imputers', KnnImputer(target = 'native-country', n_neighbors = 5)),
        ('features',feats),('clf',LogisticRegression())])

My GridSearch:

hyperparameters = {'imputers__n_neighbors' : [5,21,41], 'clf__C' : [1.0, 2.0]}

GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' , refit = False) #change n_jobs = 2, refit = False

GSCV.fit(X_train, y_train)

I receive 11 similar warnings:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

and this is the error message:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:11: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy /home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/ipykernel/main.py:14: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

--------------------------------------------------------------------------- ValueError Traceback (most recent call last) in () 3 GSCV = GridSearchCV(pipeline, hyperparameters, cv=3, scoring = 'roc_auc' ,refit = False) #change n_jobs = 2, refit = False 4 ----> 5 GSCV.fit(X_train, y_train)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups) 943 train/test set. 944 """ --> 945 return self._fit(X, y, groups, ParameterGrid(self.param_grid)) 946 947

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_search.py in _fit(self, X, y, groups, parameter_iterable) 562 return_times=True, return_parameters=True, 563 error_score=self.error_score) --> 564 for parameters in parameter_iterable 565 for train, test in cv_iter) 566

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call(self, iterable) 756 # was dispatched. In particular this covers the edge 757 # case of Parallel used with an exhausted iterator. --> 758 while self.dispatch_one_batch(iterator): 759 self._iterating = True 760 else:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in dispatch_one_batch(self, iterator) 606 return False 607 else: --> 608 self._dispatch(tasks) 609 return True 610

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in _dispatch(self, batch) 569 dispatch_timestamp = time.time() 570 cb = BatchCompletionCallBack(dispatch_timestamp, len(batch), self) --> 571 job = self._backend.apply_async(batch, callback=cb) 572 self._jobs.append(job) 573

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in apply_async(self, func, callback) 107 def apply_async(self, func, callback=None): 108 """Schedule a func to be run""" --> 109 result = ImmediateResult(func) 110 if callback: 111 callback(result)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in init(self, batch) 324 # Don't delay the application, to avoid keeping the input 325 # arguments in memory --> 326 self.results = batch() 327 328 def get(self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in call(self) 129 130 def call(self): --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len(self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in (.0) 129 130 def call(self): --> 131 return [func(*args, **kwargs) for func, args, kwargs in self.items] 132 133 def len(self):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, error_score) 236 estimator.fit(X_train, **fit_params) 237 else: --> 238 estimator.fit(X_train, y_train, **fit_params) 239 240 except Exception as e:

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py in fit(self, X, y, **fit_params) 266 This estimator 267 """ --> 268 Xt, fit_params = self._fit(X, y, **fit_params) 269 if self._final_estimator is not None: 270 self._final_estimator.fit(Xt, y, **fit_params)

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/pipeline.py in _fit(self, X, y, **fit_params) 232 pass 233 elif hasattr(transform, "fit_transform"): --> 234 Xt = transform.fit_transform(Xt, y, **fit_params_steps[name]) 235 else: 236 Xt = transform.fit(Xt, y, **fit_params_steps[name]) \

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/sklearn/base.py in fit_transform(self, X, y, **fit_params) 495 else: 496 # fit method of arity 2 (supervised transformation) --> 497 return self.fit(X, y, **fit_params).transform(X) 498 499

in fit(self, X, y) 16 self.ohe.fit(X_full) 17 #Create a Dataframe that does not contain any nulls, categ variables are OHE, with all each rows ---> 18 X_ohe_full = self.ohe.transform(X_full[~X[self.col].isnull()].drop(self.col, axis=1)) 19 20 #Fit the classifier on lines where col is null

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in getitem(self, key) 2057 return self._getitem_multilevel(key) 2058 else: -> 2059 return self._getitem_column(key) 2060 2061 def _getitem_column(self, key):

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/frame.py in _getitem_column(self, key) 2064 # get column 2065
if self.columns.is_unique: -> 2066 return self._get_item_cache(key) 2067 2068 # duplicate columns & possible reduce dimensionality

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1384 res = cache.get(item)
1385 if res is None: -> 1386 values = self._data.get(item) 1387 res = self._box_item_values(item, values) 1388
cache[item] = res

/home/jo/anaconda2/envs/py35/lib/python3.5/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3550 loc = indexer.item() 3551 else: -> 3552 raise ValueError("cannot label index with a null key") 3553 3554 return self.iget(loc, fastpath=fastpath)

ValueError: cannot label index with a null key

Your definition of hyperparameters appears to be fine. Your instantiation of GridSearchCV looks correct. It seems like the problem could be connected to your data. How did you create X_train, X_test and y_train? Could you please post your full code for creating/importing the data and creating these 3 variables? That might help offer some clues about the problem. — edesz

FChm FChm · Accepted Answer · 2019-04-17T21:00:48

Without additional information I believe it is because your X_train and y_train variables are pandas dataframe, the basic sci-kit learn library isn't comparable with these: e.g., the .fit method of a classifier is expecting an array like object.

By feeding in pandas dataframes you are inadvertently indexing them like numpy arrays, which is not that stable in pandas.

Try converting your training data to numpy arrays:

X_train_arr = X_train.to_numpy()
y_train_arr = y_train.to_numpy()

GridSearchCV on a working pipeline returns ValueError

1 Answers