dask dataframe drop duplicate index values

Question

I am using dask dataframe with python 2.7 and want to drop duplicated index values from my df.

When using pandas i would use

df = df[~df.index.duplicated(keep = "first")]

And it works

When trying to do the same with dask dataframe i get

AttributeError: 'Index' object has no attribute 'duplicated'

I could reset the index and than use the column that was the index to drop duplicated but I would like to avoid it if possible

I could use df.compute() and than drop the duplicated index values but this df is too big for memory.

How can i drop the duplicated index values from my dataframe using dask dataframe?

jezrael jezrael · Accepted Answer · 2017-11-28T14:35:16

I think you need convert index to Series by to_series, keep='first' should be omit, because default parameter in duplicated:

df = df[~df.index.to_series().duplicated()]