1
votes

I have pandas code where I do the following for one-hot-encoding.

from sklearn.preprocessing import MultiLabelBinarizer
...
mlb = MultiLabelBinarizer() 
df_tmp = pd.DataFrame(mlb.fit_transform(df['CatData']), columns=mlb.classes_, index=df.index)

where my CatData column contains a list of categories.

To deal with larger datasets, I am trying to use dask. There is a straightforward replacement for most pandas function. However, the one-hot-encoding is tricky as the categories are not known in advance. I am thinking of scanning row by row of that column across the entire dataset, putting every category found in the list into a dictionary. Then use those dictionaries to create the column names for one-hot encoding. Is there a way to do it more robustly in dask?

1

1 Answers

0
votes

You probably want the df.categorize() function.