4
votes

I would like to calculate a trailing or moving average while preforming a groupby. In the example below I have illustrated a 5 day moving average.

My data looks something like:

ID  Date         KG      Volume
1   8/10/2018    1,000   10 
1   8/11/2018    500     6 
1   8/12/2018    750     2 
1   8/13/2018    500     6 
1   8/13/2018    500     6 
1   8/14/2018    1,000   4 
2   8/1/2018     1304    8
2   8/2/2018     626     1
2   8/3/2018     955     3
2   8/8/2018     445     4
2   8/10/2018    1356    11
2   8/13/2018    524     7
2   8/14/2018    331     5

I would like to group by the ID and Date column, then calculate a trailing 5 day average (from today's date, which would be 8/10-8/14) of the KG/Volume columns. The reason I need the group by on ID and Date, is cases like ID 1, where the date 8/13/18 appears twice. In this example, I would like for ID 1 on 8/13/18 to have one entry of 1,000 KG for the average calculation (and not two entries of 500).

The expected output would be something like:

ID  Avg_KG  Avg_Vol
1   850     7 
2   442.2   4.6

Also please note ID 2 is missing 2 of the 5 days needed for the average calculation. The two missing days for ID 2, should be factored in as 0 for the average calculation. That is why the KG average is 442.2 and not 737 for ID 2.

I have tried using .rolling(5).mean(), but have been getting incorrect results when combining this with a groupby statement.

1

1 Answers

1
votes

First, some setup:

df.Date = pd.to_datetime(df.Date)
df.KG = df.KG.str.replace(',', '').astype(int)

Now groupby Date and ID and sum duplicate days:

s = df.groupby(['Date', 'ID']).agg({'KG': 'sum', 'Volume': 'sum'}).reset_index(1)

            ID    KG  Volume
Date
2018-08-01   2  1304       8
2018-08-02   2   626       1
2018-08-03   2   955       3
2018-08-08   2   445       4
2018-08-10   1  1000      10
2018-08-10   2  1356      11
2018-08-11   1   500       6
2018-08-12   1   750       2
2018-08-13   1  1000      12
2018-08-13   2   524       7
2018-08-14   1  1000       4
2018-08-14   2   331       5

Next, to resample to account for "missing" days:

out = s.groupby('ID').resample('1D').asfreq().drop('ID', 1).reset_index(0).fillna(0)

Finally, index your DataFrame for the past 5 days, and use mean:

out.groupby('ID').tail(5).groupby('ID').mean()

        KG  Volume
ID
1.0  850.0     6.8
2.0  442.2     4.6