I have a pyspark dataframe, with below sample rows. I'm trying to get max avg value in a span of 10 minutes. I am trying to use Window functions, but not able to achieve the result.
Here is my dataframe with random data for 30 minutes. I am expecting 3 rows to be the output, 1 row for each 10 minutes.
+-------------------+---------+
| event_time|avg_value|
+-------------------+---------+
|2019-12-29 00:01:00| 9.5|
|2019-12-29 00:02:00| 9.0|
|2019-12-29 00:04:00| 8.0|
|2019-12-29 00:06:00| 21.0|
|2019-12-29 00:08:00| 7.0|
|2019-12-29 00:11:00| 8.5|
|2019-12-29 00:12:00| 11.5|
|2019-12-29 00:14:00| 8.0|
|2019-12-29 00:16:00| 31.0|
|2019-12-29 00:18:00| 8.0|
|2019-12-29 00:21:00| 8.0|
|2019-12-29 00:22:00| 16.5|
|2019-12-29 00:24:00| 7.0|
|2019-12-29 00:26:00| 14.0|
|2019-12-29 00:28:00| 7.0|
+-------------------+---------+
I am using the below code for that
window_spec = Window.partitionBy('event_time').orderBy('event_time').rangeBetween(-60*10,0)
new_df = data.withColumn('rank', rank().over(window_spec))
new_df.show()
but this code is giving me the below error:
pyspark.sql.utils.AnalysisException: 'Window Frame specifiedwindowframe(RangeFrame, -600, currentrow$()) must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$());'
My desired output is
+-------------------+---------+
| event_time|avg_value|
+-------------------+---------+
|2019-12-29 00:06:00| 21.0|
|2019-12-29 00:16:00| 31.0|
|2019-12-29 00:22:00| 16.5|
+-------------------+---------+
Can someone please help me on this?
TIA.