0
votes

guys. I have a large data set (60k samples with 50 features). One of this features (which is really relevant for me) is job names. There are many jobs names that I'd like to encode to fit in some models, like linear regression or SVCs. However, I don't know how to handle them.

I tried to use pandas dummy variables and Scikit-learn One-hot Encoding, but it generate many features that I may not be encounter on test set. I tried to use the scikit-learn LabelEncoder(), but I also got some errors when I was encoding the variables float() > str() error, for example.

What would you guys recommend me to handle with this several categorical features? Thank you all.

2
FYR: You cannot use categorical variables as features in linear regression. Probably not in SVC, either. - DYZ
"it generate many features that I may not be encounter on test set". But they can be in real world data set, and that is what you are training for, I hope. - Vivek Kumar
@DYZ yes, I know. That's the purpose of this question :) - Paulo Henrique Vasconcellos

2 Answers

1
votes

There are a number of ways to achieve you want. I personally find HashingVectorizer to be robust. You may want to try it most especially if you have many (and, possibly, sparse) features. An alternative is DictVectorizer.

Take a look at examples here http://scikit-learn.org/stable/modules/feature_extraction.html and http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html. You can easily modify them to serve your purpose.

0
votes

One another solution is that, you can do a bivariate analysis of the categorical variable with the target variable. What yo will get is a result of how each level affects the target. Once you get this you can combine those levels that have a similar effect on the data. This will help you reduce number of levels, as well as each well would have a significant impact.