Pyspark: Window / Cumulative Sum with Condition

Question

Say I have data like so:

+------+-------+-------+---------------------+
| Col1 | Col2  | Col3  | Col3                |
+------+-------+-------+---------------------+
| A    | 0.532 | 0.234 | 2020-01-01 05:00:00 |
| B    | 0.242 | 0.224 | 2020-01-01 06:00:00 |
| A    | 0.152 | 0.753 | 2020-01-01 08:00:00 |
| C    | 0.149 | 0.983 | 2020-01-01 08:00:00 |
| A    | 0.635 | 0.429 | 2020-01-01 09:00:00 |
| A    | 0.938 | 0.365 | 2020-01-01 10:00:00 |
| C    | 0.293 | 0.956 | 2020-01-02 05:00:00 |
| A    | 0.294 | 0.234 | 2020-01-02 06:00:00 |
| E    | 0.294 | 0.394 | 2020-01-02 07:00:00 |
| D    | 0.294 | 0.258 | 2020-01-02 08:00:00 |
| A    | 0.687 | 0.666 | 2020-01-03 05:00:00 |
| C    | 0.232 | 0.494 | 2020-01-03 06:00:00 |
| D    | 0.575 | 0.845 | 2020-01-03 07:00:00 |
+------+-------+-------+---------------------+

I want to create another column that is:

Sum of Col2
Grouped by Col1
Only for records outside of 2 hours prior to Col3

So, for this example, looking at A, and summing Col2

+------+-------+-------+---------------------+
| Col1 | Col2  | Col3  | Col3                |
+------+-------+-------+---------------------+
| A    | 0.532 | 0.234 | 2020-01-01 05:00:00 | => Will be null, as it is the earliest
| A    | 0.152 | 0.753 | 2020-01-01 08:00:00 | => 0.532, as 05:00:00 is >= 2 hours prior
| A    | 0.635 | 0.429 | 2020-01-01 09:00:00 | => 0.532, as 08:00:00 is <2 hours, but 05:00:00 is >= 2 hours (08:00 is within 2 hours of 09:00)
| A    | 0.938 | 0.365 | 2020-01-01 10:00:00 | => 0.532 + 0.152, as 09:00:00 is < 2 hours, but 08:00:00 and 05:00:00 are >= 2 hours prior
| A    | 0.294 | 0.234 | 2020-01-01 12:00:00 | => 0.532 + 0.152 + 0.635 + 0.938, as all of the ones on the same day are >= least 2 hours prior.
| A    | 0.687 | 0.666 | 2020-01-03 05:00:00 | => Will be null, as it is the earliest this day.
+------+-------+-------+---------------------+

I have thought about sorting them and doing a cumulative sum, but not sure how to exclude the ones inside of the 2-hour range.
Have thought about grouping and summing with condition, but not totally sure how to execute that.
Have also thought about emitting records to fill the gap such that they are all hours are filled in, and taking sum up until 2 prior. However, this would require me to transform the data as it is not inherently clean at the top of each hour; they are actual random timestamps.

murtihash murtihash · Accepted Answer · 2020-07-27T23:51:35

For Spark2.4+ try this.

from pyspark.sql import functions as F
from pyspark.sql.window import Window


w=Window().partitionBy("col1",F.to_date("col4", "yyyy-MM-dd HH:mm:ss")).orderBy(F.unix_timestamp("col4"))\
           .rowsBetween(Window.unboundedPreceding, Window.currentRow)
df\
  .withColumn("try", F.collect_list("col2").over(w))\
  .withColumn("try2", F.collect_list(F.unix_timestamp("col4")).over(w))\
  .withColumn("col5", F.arrays_zip("try","try2")).drop("try")\
  .withColumn("try3",F.element_at("try2", -1))\
  .withColumn("col5", F.when(F.size("try2")>1,F.expr("""aggregate(filter(col5, x-> x.try2 <= (try3-7200)),\
                                                     cast(0 as double), (acc,y)-> acc+y.try)""")).otherwise(None))\
  .drop("try3","try2").orderBy("col1","col4").show(truncate=False)

#+----+-----+-----+-------------------+------------------+
#|col1|col2 |col3 |col4               |col5              |
#+----+-----+-----+-------------------+------------------+
#|A   |0.532|0.234|2020-01-01 05:00:00|null              |
#|A   |0.152|0.753|2020-01-01 08:00:00|0.532             |
#|A   |0.635|0.429|2020-01-01 09:00:00|0.532             |
#|A   |0.938|0.365|2020-01-01 10:00:00|0.684             |
#|A   |0.294|0.234|2020-01-01 12:00:00|2.2569999999999997|
#|A   |0.687|0.666|2020-01-03 05:00:00|null              |
#|B   |0.242|0.224|2020-01-01 06:00:00|null              |
#|C   |0.149|0.983|2020-01-01 08:00:00|null              |
#|C   |0.293|0.956|2020-01-02 05:00:00|null              |
#|C   |0.232|0.494|2020-01-03 06:00:00|null              |
#|D   |0.294|0.258|2020-01-02 08:00:00|null              |
#|D   |0.575|0.845|2020-01-03 07:00:00|null              |
#|E   |0.294|0.394|2020-01-02 07:00:00|null              |
#+----+-----+-----+-------------------+------------------+

Pyspark: Window / Cumulative Sum with Condition

1 Answers