4
votes

Say I have the following values:

                                   money_spent
time                 
2014-10-06 17:59:40.016000-04:00      1.832128
2014-10-06 17:59:41.771000-04:00      2.671048
2014-10-06 17:59:43.001000-04:00      2.019434
2014-10-06 17:59:44.792000-04:00      1.294051
2014-10-06 17:59:48.741000-04:00      0.867856

I am hoping to measure much money is spent every 2 seconds. More specifically, for every timestamp in the output, I need to see the money spent within the last 2 seconds.

When I do:

df.resample('2S', how='last')

I get:

                                money_spent
time               
2014-10-06 17:59:40-04:00          2.671048
2014-10-06 17:59:42-04:00          2.019434
2014-10-06 17:59:44-04:00          1.294051
2014-10-06 17:59:46-04:00               NaN
2014-10-06 17:59:48-04:00          0.867856

which is not what I would expect. To start with, note that the first entry in the resampled df is 2.671048, but that is at time 17:59:40, even though, according to the original dataframe, no money was spent yet. How is that possible?

3

3 Answers

5
votes

try using how=np.sum :

df.resample('2S', how=np.sum, closed='left', label='right')

Edit:

As for closed and label:

It means: from the left-closed interval, and labeled with the date from the right end of the interval. (of 2 seconds e.g. [1, 1.2, 1.5, 1.9, 2) ) . And from the docs:

closed : {‘right’, ‘left’} Which side of bin interval is closed

label : {‘right’, ‘left’} Which bin edge label to label bucket with

1
votes

You can add events to the frame, in the way that each dollar spent will leave the system two seconds afterwards; Then all you need is a cumulative sum;

There is a chance that two events may overlap; in that case, after the very last step, you need to .drop_duplicate time indices, taking the last value of money_spent for each duplicate value of time:

>>> df
                            money_spent
time                                   
2014-10-06 21:59:40.016000        1.832
2014-10-06 21:59:41.771000        2.671
2014-10-06 21:59:43.001000        2.019
2014-10-06 21:59:44.792000        1.294
2014-10-06 21:59:48.741000        0.868

>>> xdf = df.copy()   # make a copy of the original frame
>>> xdf['money_spent'] *= -1  # negate the value of `money_spent`
>>> xdf.index += np.timedelta64(2, 's')  # move the time span 2 seconds 

now, concat with the original frame, sort_index and find cumulative sum:

>>> pd.concat([df, xdf]).sort_index().cumsum()
                            money_spent
2014-10-06 21:59:40.016000    1.832e+00
2014-10-06 21:59:41.771000    4.503e+00
2014-10-06 21:59:42.016000    2.671e+00
2014-10-06 21:59:43.001000    4.690e+00
2014-10-06 21:59:43.771000    2.019e+00
2014-10-06 21:59:44.792000    3.313e+00
2014-10-06 21:59:45.001000    1.294e+00
2014-10-06 21:59:46.792000   -4.441e-16
2014-10-06 21:59:48.741000    8.679e-01
2014-10-06 21:59:50.741000   -4.441e-16

There is a floating point precision error that shows up as very small values like -4.441e-16, otherwise the numbers look correct to me:

>>> _['money_spent'].round(15)
2014-10-06 21:59:40.016000    1.832
2014-10-06 21:59:41.771000    4.503
2014-10-06 21:59:42.016000    2.671
2014-10-06 21:59:43.001000    4.690
2014-10-06 21:59:43.771000    2.019
2014-10-06 21:59:44.792000    3.313
2014-10-06 21:59:45.001000    1.294
2014-10-06 21:59:46.792000   -0.000
2014-10-06 21:59:48.741000    0.868
2014-10-06 21:59:50.741000   -0.000
Name: money_spent, dtype: float64
0
votes

The reason the first element in your return array is 2.67 is that you're using the last method for how to resample your data. If you want to change to where the first resampled point will read 1.832128, use the how='first' kwarg.