2
votes

Let's say we have

|bin | min |  end  | start | 
|1   | 5   | 10    | 
|2   | 12  | 24    | 
|3   | 28  | 36    | 
|4   | 40  | 50    | 
|5   | null| null  | 

I would want to populate start as the previous column's end to make continuous bin values. For the missing I would like to fill in with the current min instead. For null row I consider treating it separately.

What lag gives us would be

df.withColumn("start", F.lag(col("end"), 1, ***default_value***).over(orderBy(col("bin"))
|bin | min |  end  | start | 
|1   | 5   | 10    | (5 wanted)
|2   | 12  | 24    | 10
|3   | 28  | 36    | 24
|4   | 40  | 50    | 36
|5   | null| null  | null

My questions :

1/ What do we put in default_value for lag to take another column of current row, in this case min

2/ Is there a way to treat null row at the same time without separating ? I intend to filter non-null , perform lag, then union back with the null rows. How will the answer differ if Null is the first(bin 1) or last (bin 5) ?

1

1 Answers

1
votes

Use coalesce to get a column value for the first row in a group.

from pyspark.sql import functions as F 
df.withColumn("start", F.coalesce(F.lag(col("end"), 1).over(orderBy(col("bin")),col("min")))

lag currently doesn't support ignorenulls option, so you might have to separate out the null rows, compute the start column for non-null rows and union the data frames.