14
votes

Here is my question, I hope someone can help me to figure it out..

To explain, there are more than 10 categorical columns in my data set and each of them has 200-300 categories. I want to convert them into binary values. For that I used first label encoder to convert string categories into numbers. The Label Encoder code and the output is shown below.

https://i.stack.imgur.com/MIVHV.png

After Label Encoder, I used One Hot Encoder From scikit-learn again and it is worked. BUT THE PROBLEM IS, I need column names after one hot encoder. For example, column A with categorical values before encoding. A = [1,2,3,4,..]

It should be like that after encoding,

A-1, A-2, A-3

Anyone know how to assign column names to (old column names -value name or number) after one hot encoding. Here is my one hot encoding and it's output;

https://i.stack.imgur.com/kgrNa.png

I need columns with name because I trained an ANN, but every time data comes up I cannot convert all past data again and again. So, I want to add just new ones every time. Thank anyway..

3
Please, DO NOT use images of code. Copy the actual text from your code editor, paste it into the question, then format it as code. This helps others more easily read and test your code.sentence

3 Answers

16
votes

You can get the column names using .get_feature_names() attribute.

>>> ohenc.get_feature_names()
>>> x_cat_df.columns = ohenc.get_feature_names()

Detailed example is here.

11
votes

This example could help for future readers:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

train_X = pd.DataFrame({'Sex':['male', 'female']*3, 'AgeGroup':[0,15,30,45,60,75]})
>>>
     Sex     AgeGroup
0    male         0
1  female        15
2    male        30
3  female        45
4    male        60
5  female        75
encoder=OneHotEncoder(sparse=False)

train_X_encoded = pd.DataFrame (encoder.fit_transform(train_X[['Sex']]))

train_X_encoded.columns = encoder.get_feature_names(['Sex'])

train_X.drop(['Sex'] ,axis=1, inplace=True)

OH_X_train= pd.concat([train_X, train_X_encoded ], axis=1)
>>>
    AgeGroup  Sex_female  Sex_male
0         0         0.0       1.0
1        15         1.0       0.0
2        30         0.0       1.0
3        45         1.0       0.0
4        60         0.0       1.0
5        75         1.0       0.0`
1
votes

Hey I had the same problem whereby I had a custom Estimator which extended the BaseEstimator Class from Sklearn.base

I added a class attribute into the init called self.feature_names then as a last step in the transform method just updated self.feature_names with the columns from the result.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class CustomOneHotEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, **kwargs):
        self.feature_names = []

    def fit(self, X, y=None):
        return self

    def transform(self, X):

        result = pd.get_dummies(X)
        self.feature_names = result.columns

        return result

A bit basic I know but it does the job I need it to.

If you want to retrieve the column names for the feature importances from your sklearn pipeline you can get the features from the classifier step and the column names from the one hot encoding step.

a = model.best_estimator_.named_steps["clf"].feature_importances_
b = model.best_estimator_.named_steps["ohc"].feature_names

df = pd.DataFrame(a,b)
df.sort_values(by=[0], ascending=False).head(20)