0
votes

I have a banking_dataframe with 21 different columns, one is target, 10 of them are numeric features and 10 of them are categorical features. I have used get_dummies method of pandas to convert categorical data to one-hot encoding. The returned dataframe has 74 columns. Now, I want to merge the encoded dataframe with the original data frame, so my final data should have one-hot encoded values for categorical columns but in the original size of data-frame i.e; 21 columns.

Link to get_dummies function of Pandas:

Code snippet to call get_dummies on categorical features

encoded_features = pd.get_dummies(banking_dataframe[categorical_feature_names])
2
pd.concat with axis=1?Quang Hoang
banking_dataframe.join(pd.get_dummies(banking_dataframe[categorical_feature_names])?political scientist
I tried both "pd.concat" and "join" strategy, the results are same in both cases. If I explain more, the actual data frame was (41188, 21) in size, now after encoding and concatenating the size of data is (41188, 74), you see dimensions has increased. Don't we need to bring them back to actual size after encoding? Shall I pass the new dimensional data to my model?Fariha Abbasi

2 Answers

1
votes
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# creating a toy data frame to test
df = pd.DataFrame({'Gender': ['M', 'F', 'M', 'M', 'F', 'F', 'F']})

# instantiating and transforming the 'Gender' column of the df
one_hot = OneHotEncoder()
encoded = one_hot.fit_transform(df[['Gender']])

# one_hot object has an attribute 'categories_', which stores the array
# of categories sequentially, and those categories can serve as 
# new columns in our data frame.

df[one_hot.categories_[0]] = encoded.toarray()
0
votes

You can try this:

pd.concat([df,encoded_features],axis=1)

If you don't want to increase the dimensions try doing label encoding instead of pd.get_dummies() because pd.get_dummies() add new columns to the dataset while label encoding does the encoding in the column itself.
Try this:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['Categorical column_name'] = le.fit_transform(df['Categorical column_name'])