I think you're looking for pandas.get_dummies
See the following example.
df = pd.DataFrame({"col_a":['cat','dog','cat','mouse','mouse','cat'], 'col_b':[10,14,16,18,20,22], 'col_c':['a','a','a','b','b','a']})
# `drop_first` parameter will drop the one categorical column
df = pd.get_dummies(df, columns=['col_a','col_c'], drop_first=True)
print(df)
Output:
col_b col_a_dog col_a_mouse col_c_b
0 10 0 0 0
1 14 1 0 0
2 16 0 0 0
3 18 0 1 1
4 20 0 1 1
5 22 0 0 0
It covers first 2 conditions that you mentioned.
For 3rd condition you can do the following.
- create the dummies on the training data
dummy_train = pd.get_dummies(train)
- create the dummies in the new (unseen data)
dummy_new = pd.get_dummies(new_data)
- re-index the new data to the columns of the training data, filling the missing values with 0
dummy_new.reindex(columns = dummy_train.columns, fill_value=0)
Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.
pandas
orsklearn
. However, with a little coding, you can wrapOneHotEncoder
to do what you want. – gmds