Let's say we have
|bin | min | end | start |
|1 | 5 | 10 |
|2 | 12 | 24 |
|3 | 28 | 36 |
|4 | 40 | 50 |
|5 | null| null |
I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.
What lag gives us would be
df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min | end | start |
|1 | 5 | 10 | (5 wanted)
|2 | 12 | 24 | 10
|3 | 28 | 36 | 24
|4 | 40 | 50 | 36
|5 | null| null | null
My questions :
1/ What do we put in default_value for lag to take another column of current row, in this case min
2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?