I have a dataframe where a value label
is associated with (id, bin, date, hour)
:
+----------+----+-----+---+-------------------+
| date|hour| id|bin| label|
+----------+----+-----+---+-------------------+
|2019_12_20| 8| 1| 0| 151.7050821002368|
|2019_12_20| 8| 1| 2| 101.13672140015788|
|2019_12_20| 8| 1| 3| 16.856120233359647|
...
I want to append multiple columns to this dataframe corresponding to the label
at the same hour on the previous day, an hour ago on the previous day, etc. I know how to get the first with the lag function:
val dateWindow = Window.partitionBy($"id", $"bin").orderBy($"hour", $"date")
val expandedDf = data.withColumn("yesterdaySameHour", lag($"label", 1, 0.0).over(dateWindow))
However, I can't figure out how to get the value label
at hour - 1
on the previous day. Is there a way to have a conditional lag where I can filter out hour
that's larger than or equal to the current row's hour? If not, what's the proper way to do this?
Many thanks.