1
votes

Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?

  1. pandas friendly - option to return a dataframe
  2. should be able to drop 1 column in one-hot encoding
  3. handling of unseens categories in test data.
  4. compatible with sklearn Pipeline object.
1
Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want.gmds
yes. i couldn't find something on these lines..solver149

1 Answers

0
votes

I think you're looking for pandas.get_dummies

See the following example.

df = pd.DataFrame({"col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a']})

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)

Output:

  col_b  col_a_dog  col_a_mouse  col_c_b                                                                                               
0     10          0            0        0                                                                                               
1     14          1            0        0                                                                                               
2     16          0            0        0                                                                                               
3     18          0            1        1                                                                                               
4     20          0            1        1                                                                                               
5     22          0            0        0      

It covers first 2 conditions that you mentioned.

For 3rd condition you can do the following.

  • create the dummies on the training data
    dummy_train = pd.get_dummies(train)
  • create the dummies in the new (unseen data)
    dummy_new = pd.get_dummies(new_data)
  • re-index the new data to the columns of the training data, filling the missing values with 0
    dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.