6
votes

I have a simple sklearn class I would like to use as part of an sklearn pipeline. This class just takes a pandas dataframe X_DF and a categorical column name, and calls pd.get_dummies to return the dataframe with the column turned into a matrix of dummy variables...

import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator

class dummy_var_encoder(TransformerMixin, BaseEstimator):
    '''Convert selected categorical column to (set of) dummy variables    
    '''


    def __init__(self, column_to_dummy='default_col_name'):
        self.column = column_to_dummy
        print self.column

    def fit(self, X_DF, y=None):
        return self 

    def transform(self, X_DF):
        ''' Update X_DF to have set of dummy-variables instead of orig column'''        

        # convert self-attribute to local var for ease of stepping through function
        column = self.column

        # add columns for new dummy vars, and drop original categorical column
        dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

        new_DF = pd.concat([X_DF[column], dummy_matrix], axis=1)

        return new_DF

Now using this transformer on it's own to fit/transform, I get output as expected. For some toy data as below:

from sklearn import datasets
# Load toy data 
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns = iris.feature_names)
y = pd.Series(iris.target, name='y')

# Create Arbitrary categorical features
X['category_1'] = pd.cut(X['sepal length (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

X['category_2'] = pd.cut(X['sepal width (cm)'], 
                         bins=3, 
                         labels=['small', 'medium', 'large'])

...my dummy encoder produces the correct output:

encoder = dummy_var_encoder(column_to_dummy = 'category_1')
encoder.fit(X)
encoder.transform(X).iloc[15:21,:]

category_1
   category_1  category_1_small  category_1_medium  category_1_large
15     medium                 0                  1                 0
16      small                 1                  0                 0
17      small                 1                  0                 0
18     medium                 0                  1                 0
19      small                 1                  0                 0
20      small                 1                  0                 0

However, when I call the same transformer from an sklearn pipeline as defined below:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold, GridSearchCV

# Define Pipeline
clf = LogisticRegression(penalty='l1')
pipeline_steps = [('dummy_vars', dummy_var_encoder()),
                  ('clf', clf)
                  ]

pipeline = Pipeline(pipeline_steps)

# Define hyperparams try for dummy-encoder and classifier
# Fit 4 models - try dummying category_1 vs category_2, and using l1 vs l2 penalty in log-reg
param_grid = {'dummy_vars__column_to_dummy': ['category_1', 'category_2'],
              'clf__penalty': ['l1', 'l2']
                  }

# Define full model search process 
cv_model_search = GridSearchCV(pipeline, 
                               param_grid, 
                               scoring='accuracy', 
                               cv = KFold(),
                               refit=True,
                               verbose = 3) 

All's well until I fit the pipeline, at which point I get an error from the dummy encoder:

cv_model_search.fit(X,y=y)

In [101]: cv_model_search.fit(X,y=y) Fitting 3 folds for each of 4 candidates, totalling 12 fits

None None None None [CV] dummy_vars__column_to_dummy=category_1, clf__penalty=l1 .........

Traceback (most recent call last):

File "", line 1, in cv_model_search.fit(X,y=y)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_search.py", line 638, in fit cv.split(X, y, groups)))

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator):

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 332, in init self.results = batch()

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 131, in call return [func(*args, **kwargs) for func, args, kwargs in self.items]

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/model_selection/_validation.py", line 437, in _fit_and_score estimator.fit(X_train, y_train, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 257, in fit Xt, fit_params = self._fit(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 222, in _fit **fit_params_steps[name])

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/externals/joblib/memory.py", line 362, in call return self.func(*args, **kwargs)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/pipeline.py", line 589, in _fit_transform_one res = transformer.fit_transform(X, y, **fit_params)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/sklearn/base.py", line 521, in fit_transform return self.fit(X, y, **fit_params).transform(X)

File "", line 21, in transform dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1964, in getitem return self._getitem_column(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/frame.py", line 1971, in _getitem_column return self._get_item_cache(key)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/generic.py", line 1645, in _get_item_cache values = self._data.get(item)

File "/home/max/anaconda3/envs/remine/lib/python2.7/site-packages/pandas/core/internals.py", line 3599, in get raise ValueError("cannot label index with a null key")

ValueError: cannot label index with a null key

1
Yes. Thats because, inside a pipeline (most probably due to gridSearchCV), the type of X is changed from Dataframe to numpy array which dont have any index or columns with it. Hence doing this will give errors>Vivek Kumar
Thanks Vivek. I've definitely used sklearn pipelines with custom transformers that accept/return a pandas dataframe before, still trying to figure out why my old one worked and this seemingly minimal example doesn't. I think you're probably right about gridSearchCV being the issue, I used a custom cv iterator on my last project...Max Power

1 Answers

4
votes

The trace is telling you exactly what went wrong. Learning to diagnose the trace is really quite invaluable especially when you are inheriting from libraries you might not have a complete understanding of.

Now, I have done a fair bit of inheriting in sklearn myself and I can tell you without a doubt GridSearchCV is going to give you some trouble if the type of data input into your fit or fit_transform methods are not NumPy arrays. As Vivek mentioned in his comment the X getting passed to your fit method is no longer a DataFrame. But let's take a look at the trace first.

ValueError: cannot label index with a null key

While Vivek is correct about the NumPy array you have another problem here. The actual error you get is that the value of column in your fit method is None. If you were to look at your encoder object above you would see the __repr__ method outputs the following:

dummy_var_encoder(column_to_dummy=None)

When using Pipeline, this param is what gets initialized and passed along to GridSearchCV. This is a behavior that can be seen throughout cross validation and search methods as well, and having attributes with different names from the input parameter causes issues like this. Fixing this will start you down the right path.

Modifying the __init__ method as such will solve this specific issue:

def __init__(self, column='default_col_name'):
    self.column = column
    print(self.column)

However, once you have done this the issue Vivek mentioned will rear it's head and you will have to deal with that. This is something I have run into before, though not with DataFrames specifically. I came up with a solution in Use sklearn GridSearchCV on custom class whose fit method takes 3 arguments. Basically I created a wrapper that implements the __getitem__ method in a way that makes the data look and behave in a way that it will pass the validation methods used in GridSearchCV, Pipeline, and other cross validation methods.

Edit

I made these changes and it looks like your problem then comes from the validation method check_array. While calling this method with dtype=pd.DataFrame would work, the linear model calls this with dtype=np.float64 throwing an error. To get around this instead of concatenating the original data with you dummies you could just return your dummy columns and fit using those. This is something that should be done anyway since you wouldn't want to include both dummy columns and the original data in the model you are trying to fit. You may also consider the drop_first option, but I'm getting off subject. So, changing your fit method like so allows the whole process to work as expected.

def transform(self, X_DF):
    ''' Update X_DF to have set of dummy-variables instead of orig column'''        

    # convert self-attribute to local var for ease of stepping through function
    column = self.column

    # add columns for new dummy vars, and drop original categorical column
    dummy_matrix = pd.get_dummies(X_DF[column], prefix=column)

    return dummy_matrix