Say I have data like so:
+------+-------+-------+---------------------+
| Col1 | Col2 | Col3 | Col3 |
+------+-------+-------+---------------------+
| A | 0.532 | 0.234 | 2020-01-01 05:00:00 |
| B | 0.242 | 0.224 | 2020-01-01 06:00:00 |
| A | 0.152 | 0.753 | 2020-01-01 08:00:00 |
| C | 0.149 | 0.983 | 2020-01-01 08:00:00 |
| A | 0.635 | 0.429 | 2020-01-01 09:00:00 |
| A | 0.938 | 0.365 | 2020-01-01 10:00:00 |
| C | 0.293 | 0.956 | 2020-01-02 05:00:00 |
| A | 0.294 | 0.234 | 2020-01-02 06:00:00 |
| E | 0.294 | 0.394 | 2020-01-02 07:00:00 |
| D | 0.294 | 0.258 | 2020-01-02 08:00:00 |
| A | 0.687 | 0.666 | 2020-01-03 05:00:00 |
| C | 0.232 | 0.494 | 2020-01-03 06:00:00 |
| D | 0.575 | 0.845 | 2020-01-03 07:00:00 |
+------+-------+-------+---------------------+
I want to create another column that is:
- Sum of Col2
- Grouped by Col1
- Only for records outside of 2 hours prior to Col3
So, for this example, looking at A, and summing Col2
+------+-------+-------+---------------------+
| Col1 | Col2 | Col3 | Col3 |
+------+-------+-------+---------------------+
| A | 0.532 | 0.234 | 2020-01-01 05:00:00 | => Will be null, as it is the earliest
| A | 0.152 | 0.753 | 2020-01-01 08:00:00 | => 0.532, as 05:00:00 is >= 2 hours prior
| A | 0.635 | 0.429 | 2020-01-01 09:00:00 | => 0.532, as 08:00:00 is <2 hours, but 05:00:00 is >= 2 hours (08:00 is within 2 hours of 09:00)
| A | 0.938 | 0.365 | 2020-01-01 10:00:00 | => 0.532 + 0.152, as 09:00:00 is < 2 hours, but 08:00:00 and 05:00:00 are >= 2 hours prior
| A | 0.294 | 0.234 | 2020-01-01 12:00:00 | => 0.532 + 0.152 + 0.635 + 0.938, as all of the ones on the same day are >= least 2 hours prior.
| A | 0.687 | 0.666 | 2020-01-03 05:00:00 | => Will be null, as it is the earliest this day.
+------+-------+-------+---------------------+
I have thought about sorting them and doing a cumulative sum, but not sure how to exclude the ones inside of the 2-hour range.
Have thought about grouping and summing with condition, but not totally sure how to execute that.
Have also thought about emitting records to fill the gap such that they are all hours are filled in, and taking sum up until 2 prior. However, this would require me to transform the data as it is not inherently clean at the top of each hour; they are actual random timestamps.