Handling categorical variables in sklearn with one-hot encoding

Question

Can someone help with any existing Python class for categorical encoder for sklearn that ticks the following checkboxes?

pandas friendly - option to return a dataframe
should be able to drop 1 column in one-hot encoding
handling of unseens categories in test data.
compatible with sklearn Pipeline object.

Such a thing does not exist natively in pandas or sklearn. However, with a little coding, you can wrap OneHotEncoder to do what you want. — gmds

Sociopath Sociopath · Accepted Answer · 2019-03-22T04:35:09

I think you're looking for pandas.get_dummies

See the following example.

df = pd.DataFrame({"col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a']})

# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)

Output:

  col_b  col_a_dog  col_a_mouse  col_c_b                                                                                               
0     10          0            0        0                                                                                               
1     14          1            0        0                                                                                               
2     16          0            0        0                                                                                               
3     18          0            1        1                                                                                               
4     20          0            1        1                                                                                               
5     22          0            0        0

It covers first 2 conditions that you mentioned.

For 3rd condition you can do the following.

create the dummies on the training data
dummy_train = pd.get_dummies(train)
create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

Handling categorical variables in sklearn with one-hot encoding

1 Answers