0
votes

I need to add a column to my DASK dataframe, which should contain auto-increment IDs. I have an idea how to do it in Pandas, as I have found a Pandas solution on SO, but I cannot figure out how to do it in DASK. My best attempt looks like this, and it turns out the autoincrement function only runs twice for my 100 line test file and all of the ids are 2.

def autoincrement(self):
    print('*')
    self.report_line = self.report_line + 1
    return self.report_line

self.df = self.df.map_partitions(
    lambda df: df.assign(raw_report_line=self.autoincrement())
)

The Pandas way looks something like this

df.insert(0, 'New_ID', range(1, 1 + len(df)))

Alternatively, if I can fetch the row number of the specific CSV row and add that to a column, that would be great, at this stage, it does not seem easily possible.

1

1 Answers

2
votes

You can assign a dummy column of all 1s and take the cumsum

In [1]: import dask.datasets

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: df = dask.datasets.timeseries()

In [5]: df
Out[5]:
Dask DataFrame Structure:
                   id    name        x        y
npartitions=30
2000-01-01      int64  object  float64  float64
2000-01-02        ...     ...      ...      ...
...               ...     ...      ...      ...
2000-01-30        ...     ...      ...      ...
2000-01-31        ...     ...      ...      ...
Dask Name: make-timeseries, 30 tasks

In [6]: df['row_number'] = df.assign(partition_count=1).partition_count.cumsum()

In [7]: df.compute()
Out[7]:
                       id      name         x         y  row_number
timestamp
2000-01-01 00:00:00   928     Sarah -0.597784  0.160908           1
2000-01-01 00:00:01  1000     Zelda -0.034756 -0.073912           2
2000-01-01 00:00:02  1028  Patricia -0.962331 -0.458834           3
2000-01-01 00:00:03  1010    Hannah -0.225759 -0.227945           4
2000-01-01 00:00:04   958   Charlie  0.223131 -0.672307           5
...                   ...       ...       ...       ...         ...
2000-01-30 23:59:55  1052     Jerry -0.636159  0.683076     2591996
2000-01-30 23:59:56   973     Quinn -0.575324  0.272144     2591997
2000-01-30 23:59:57  1049     Jerry  0.143286 -0.122490     2591998
2000-01-30 23:59:58   971    Victor -0.866174  0.751534     2591999
2000-01-30 23:59:59   966     Edith -0.718382 -0.333261     2592000

[2592000 rows x 5 columns]