Create unique ids for row in dask

Question

I need to add an id for the rows in a dask dataframe, first thing I tried was to add an accumulative index as shown in this other question

df["idx"] = 1
df["idx"] = df["idx"].cumsum()

But my laptop crashed so maybe a random unique id is an option for this

As additional information, the file I'm using its 10GB in parquet format and 20Gb in CSV and my laptop has 16Gb of RAM

The other option I don't know if possible, is to just append/add the new column to the file without loading it into memory

MRocklin MRocklin · Accepted Answer · 2020-01-02T17:45:30

I would figure out some code that does this for Pandas, and then use the map_partitions method to apply the same function in parallel. Maybe something like the following?

def add_unique_id_column(df: pandas.DataFrame) -> pandas.DataFrame:
    ...

df = df.map_partitions(add_unique_id_column)

Create unique ids for row in dask

1 Answers