I have a pySpark dataFrame like:
class classID Property
1 1 1
1 2 0
1 3 1
1 4 1
2 1 0
2 2 0
2 3 1
Now I need to add a column with the information how many rows within the current partition until this row do have the Property == 1. Like here:
class classID Property relevantCount
1 1 1 1
1 2 0 1
1 3 1 2
1 4 1 3
2 1 0 0
2 2 0 0
2 3 1 1
E.g. I tried a Window function:
import pyspark.sql.functions as f
from pyspark.sql.window import Window
windowSpec = Window().partitionBy('class').orderBy(f.col('classID'))
df = df \
.withColumn('relevantCount',(f.when((f.col('rank') == f.lit(1)) & (f.col('Property') == f.lit(0)),0)).otherwise(f.col('Property')+f.col(f.lag('deliveryCountDesc').over(windowSpec))))
But I can not reference on the previous values of the new line.
Does anyone have a better idea?