Label encoding multiple columns with the same category

Question

Consider the following dataframe:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)

It currently outputs:

   a  b  c
0  0  1  0
1  1  0  0

My goal is to make it output something like this by passing in the columns I want to share categorial values:

   a  b  c
0  0  1  2
1  1  0  2

unutbu unutbu · Accepted Answer · 2018-02-04T22:01:26

Pass axis=1 to call LabelEncoder().fit_transform once for each row. (By default, df.apply(func) calls func once for each column).

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

encoder = LabelEncoder()

df = df.apply(encoder.fit_transform, axis=1)
print(df)

yields

   a  b  c
0  1  2  0
1  2  1  0

Alternatively, you could use make the data of category dtype and use the category codes as labels:

import pandas as pd

df = pd.DataFrame(data=[["France", "Italy", "Belgium"], 
                        ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])

stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)

also yields

   a  b  c
0  1  2  0
1  2  1  0

This should be significantly faster since it does not require calling encoder.fit_transform once for each row (which might give terrible performance if you have lots of rows).

Label encoding multiple columns with the same category

4 Answers