Categorical Feature Encoding as Enum for Scikit-Learn

Question

I am currently trying to preprocess a very large dataset with a lot of categorical features for Scikit-Learns' RandomForest Model (Regression). The nature of the categorical data requires to not have any ordinality added through encoding schemes. The H2o ML-Framework (Link) offers of enum-encoding which would suite perfectly for my data. However I rely on Scikit-Learns RandomForest.

Is anyone aware of some enum-encoding for Scikit-Learn Models? (One-Hot-Encoding is not an option)

Thanks in Advance!

Not sure enum encoding (never heard of it actually), but see github.com/scikit-learn-contrib/categorical-encoding, or github.com/dirty-cat/dirty_cat, and as mentioned CatBoost has many basic-ti-advanced built-in encoding methods. — TwinPenguins

Mischa Lisovyi Mischa Lisovyi · Accepted Answer · 2018-06-14T09:46:06

There is only label-encoding, LabelEncoder, together with OHE available in sklearn. However, it does not provide the functionality that you want, as categories are simply encoded as integers and this is meaningful for ordinal categories only, I believe. I believe, in sklearn it is left up to models to implement such enum category treatment (because there are many models in sklearn and most of them would not be able to benefit from such encoding).

I think, LightGBM claims here that it implements internally such type of category treatment, but i'm actually not 100% sure if that is true. The advantage is that they have both RF and GBM tree builders, so you cab easily switch between those and it is faster than sklearn implementation.

Note also that CatBoost has a reach toolkit for internal category encoding, but I have zero experience with it so far.

Categorical Feature Encoding as Enum for Scikit-Learn

1 Answers