How to encode categorical with many levels on scikit-learn?

Question

guys. I have a large data set (60k samples with 50 features). One of this features (which is really relevant for me) is job names. There are many jobs names that I'd like to encode to fit in some models, like linear regression or SVCs. However, I don't know how to handle them.

I tried to use pandas dummy variables and Scikit-learn One-hot Encoding, but it generate many features that I may not be encounter on test set. I tried to use the scikit-learn LabelEncoder(), but I also got some errors when I was encoding the variables float() > str() error, for example.

What would you guys recommend me to handle with this several categorical features? Thank you all.

FYR: You cannot use categorical variables as features in linear regression. Probably not in SVC, either. — DYZ
"it generate many features that I may not be encounter on test set". But they can be in real world data set, and that is what you are training for, I hope. — Vivek Kumar

emmanuelsa emmanuelsa · Accepted Answer · 2017-05-23T01:51:44

There are a number of ways to achieve you want. I personally find HashingVectorizer to be robust. You may want to try it most especially if you have many (and, possibly, sparse) features. An alternative is DictVectorizer.

Take a look at examples here http://scikit-learn.org/stable/modules/feature_extraction.html and http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html. You can easily modify them to serve your purpose.

How to encode categorical with many levels on scikit-learn?

2 Answers