0
votes

I have dataframe like this:

   STOREID VARIANT_ARTICLE   PO_DATE  UNITSUM
0    st123           12345  20200427      9.0
1    st123           12345  20200428      3.0
2    st123           12345  20200429     13.0
3    st123           12345  20200430      7.0
4    st123           12345  20200501     16.0
5    st123           12345  20200502      3.0
6    st123           12345  20200503      5.0
7    st123           12345  20200504     10.0
8    st123           12345  20200505      3.0
9    st123           12345  20200506      7.0
10   st123           12345  20200507     29.0
11   st123           12345  20200508      4.0
12   st123           12345  20200509      9.0
13   st123           12345  20200510      8.0
14   st123           12345  20200511      5.0
15   st123           12345  20200513      8.0
16   st123           12345  20200514      2.0
17   st123           12345  20200515      2.0
18   st123           12345  20200516      2.0

I want to calculate rolling sum and avg of the UNITSUM column. The catch here is that I need to calculate it for past 4 days (for example) and not previous 4 records which essentially means, for the 15th row in my example, the date range to aggregate for is 20200510 - 20200513. Since there is no entry for 20200512, we do the aggregation on the 3 available rows and don't include 20200509 in our calculation (like pandas does in rolling function).

Is there a way to achieve that?

EDIT: I have to achieve this using dask-dataframe API.

1

1 Answers

1
votes

The dask dataframe has the same syntax to the pandas API:

In [38]: ddf = dask.datasets.timeseries()

In [39]: ddf.head()
Out[39]:
                       id    name         x         y
timestamp
2000-01-01 00:00:00  1003  George -0.287285  0.773949
2000-01-01 00:00:01   992  Oliver -0.738190  0.893916
2000-01-01 00:00:02   972   Jerry  0.080410 -0.972037
2000-01-01 00:00:03   970  George -0.402327  0.034718
2000-01-01 00:00:04  1034   Alice -0.694517  0.646178

In [40]: ddf.x.rolling(4).agg({'sum': 'sum', 'mean': 'mean'}).head()
Out[40]:
                          sum      mean
timestamp
2000-01-01 00:00:00       NaN       NaN
2000-01-01 00:00:01       NaN       NaN
2000-01-01 00:00:02       NaN       NaN
2000-01-01 00:00:03 -1.347393 -0.336848
2000-01-01 00:00:04 -1.754625 -0.438656