NaN giving ValueError in OneHotEncoder in scikit-learn

Question

Here is my code

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

train = pd.DataFrame({
        'users':['John Johnson','John Smith','Mary Williams']
})
test = pd.DataFrame({
        'users':[None,np.nan,'John Smith','Mary Williams']
})

ohe = OneHotEncoder(sparse=False,handle_unknown='ignore')
ohe.fit(train)
train_transformed = ohe.fit_transform(train)

test_transformed = ohe.transform(test)
print(test_transformed)

I expected the OneHotEncoder to be able to handle the np.nan in the test dataset, since

handle_unknown='ignore'

But it gives ValueError. It is able to handle the None value though. Why is it failing?And how do I get around it (besides Imputer)?

From the documentation (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) it seemed that this was what handle_unknown is for.

Amine Benatmane Amine Benatmane · Accepted Answer · 2019-10-03T14:06:08

You must empute missing values first. handle_unknown='ignore' doesn't concerne NaN values but new categories not fitted in ohe.

You can consider NaNs as a distinct category as follow:

train = train.fillna("NaN")
test = test.fillna("NaN")

NaN giving ValueError in OneHotEncoder in scikit-learn

2 Answers