0
votes

I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.

There are 83 unique State Names.

Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.

Is there anything else I can try?

2

2 Answers

2
votes

I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.

2
votes

There are other powerful encoding schemes beside one hot, which do not increase number of columns . You can try the following (in increasing order of complexity):

  • count encoding: encode each category by the number of times it occurs in data, useful in some cases. For example, if you want to encode the information that New York is a big city, the count of NY in data really contains that info, as we expect that NY will occur frequently.

  • target encoding: encode each category by the average value of target/outcome (if target is continuous) within that category; or by the probability of target if it is discrete. An example is when you want to encode neighborhood, which is obviously important for predicting house price; definitely, you can replace each neighborhood name by the average house price in that neighborhood. This improves prediction incredibly (as shown in my Kaggle notebook for predicting house price).

There are still other useful encoding schemes like Catboost, weight of evidence, etc. A really nice thing is all these schemes are already implemented in library categorical encoder here.