Cumulative sum in pyspark

Question

I am trying to compute the cumulative sum per class. Code is working fine by using sum(df.value).over(Window.partitionBy('class').orderBy('time'))

df = sqlContext.createDataFrame( [(1,10,"a"),(3,2,"a"),(1,2,"b"),(2,5,"a"),(2,1,"b"),(9,0,"b"),(4,1,"b"),(7,8,"a"),(3,8,"b"),(2,5,"a"),(0,0,"a"),(4,3,"a")], 
                                     ["time", "value", "class"] )

time|value|class|
+----+-----+-----+
|   1|   10|    a|
|   3|    2|    a|
|   1|    2|    b|
|   2|    5|    a|
|   2|    1|    b|
|   9|    0|    b|
|   4|    1|    b|
|   7|    8|    a|
|   3|    8|    b|
|   2|    5|    a|
|   0|    0|    a|
|   4|    3|    a|


df.withColumn('cumsum_value', sum(df.value).over(Window.partitionBy('class').orderBy('time'))).show()


time|value|class|cumsum_value|
+----+-----+-----+------------+
|   1|    2|    b|           2|
|   2|    1|    b|           3|
|   3|    8|    b|          11|
|   4|    1|    b|          12|
|   9|    0|    b|          12|
|   0|    0|    a|           0|
|   1|   10|    a|          10|
|   2|    5|    a|          20|
|   2|    5|    a|          20|
|   3|    2|    a|          22|
|   4|    3|    a|          25|
|   7|    8|    a|          33|
+----+-----+-----+------------+

But its not working with duplicate rows. Desired output should be:

 time|value|class|cumsum_value|
+----+-----+-----+------------+
|   1|    2|    b|           2|
|   2|    1|    b|           3|
|   3|    8|    b|          11|
|   4|    1|    b|          12|
|   9|    0|    b|          12|
|   0|    0|    a|           0|
|   1|   10|    a|          10|
|   2|    5|    a|          15|
|   2|    5|    a|          20|
|   3|    2|    a|          22|
|   4|    3|    a|          25|
|   7|    8|    a|          33|
+----+-----+-----+------------+

Seems like you want .orderBy('time', 'value') (i.e. you have to define how to break ties in the case that the times are the same) — pault

murtihash murtihash · Accepted Answer · 2020-04-17T18:10:07

Adding to @pault's comment, I would suggest a row_number() calculation based on orderBy('time', 'value') and then use that column in the orderBy of another window(w2) to get your cum_sum.

This will handle both cases where time is the same and value is the same, and where time is the same but value isnt.

from pyspark.sql import functions as F
from pyspark.sql.window import Window
w1=Window().partitionBy("class").orderBy("time","value")
w2=Window().partitionBy("class").orderBy('rownum')
df.withColumn('rownum', F.row_number().over(w1))\
  .withColumn('cumsum_value', F.sum("value").over(w2)).drop('rownum').show()

+----+-----+-----+------------+
|time|value|class|cumsum_value|
+----+-----+-----+------------+
|   1|    2|    b|           2|
|   2|    1|    b|           3|
|   3|    8|    b|          11|
|   4|    1|    b|          12|
|   9|    0|    b|          12|
|   0|    0|    a|           0|
|   1|   10|    a|          10|
|   2|    5|    a|          15|
|   2|    5|    a|          20|
|   3|    2|    a|          22|
|   4|    3|    a|          25|
|   7|    8|    a|          33|
+----+-----+-----+------------+

Cumulative sum in pyspark

1 Answers