Dask one-hot-encoding without knowing the categories

Question

I have pandas code where I do the following for one-hot-encoding.

from sklearn.preprocessing import MultiLabelBinarizer
...
mlb = MultiLabelBinarizer() 
df_tmp = pd.DataFrame(mlb.fit_transform(df['CatData']), columns=mlb.classes_, index=df.index)

where my CatData column contains a list of categories.

To deal with larger datasets, I am trying to use dask. There is a straightforward replacement for most pandas function. However, the one-hot-encoding is tricky as the categories are not known in advance. I am thinking of scanning row by row of that column across the entire dataset, putting every category found in the list into a dictionary. Then use those dictionaries to create the column names for one-hot encoding. Is there a way to do it more robustly in dask?

MRocklin MRocklin · Accepted Answer · 2019-09-21T00:29:58

0

votes

You probably want the df.categorize() function.

Dask one-hot-encoding without knowing the categories

1 Answers