pySpark - get max value row in a rolling window

Question

I have a pyspark dataframe, with below sample rows. I'm trying to get max avg value in a span of 10 minutes. I am trying to use Window functions, but not able to achieve the result.

Here is my dataframe with random data for 30 minutes. I am expecting 3 rows to be the output, 1 row for each 10 minutes.

+-------------------+---------+
|         event_time|avg_value|
+-------------------+---------+
|2019-12-29 00:01:00|      9.5|
|2019-12-29 00:02:00|      9.0|
|2019-12-29 00:04:00|      8.0|
|2019-12-29 00:06:00|     21.0|
|2019-12-29 00:08:00|      7.0|
|2019-12-29 00:11:00|      8.5|
|2019-12-29 00:12:00|     11.5|
|2019-12-29 00:14:00|      8.0|
|2019-12-29 00:16:00|     31.0|
|2019-12-29 00:18:00|      8.0|
|2019-12-29 00:21:00|      8.0|
|2019-12-29 00:22:00|     16.5|
|2019-12-29 00:24:00|      7.0|
|2019-12-29 00:26:00|     14.0|
|2019-12-29 00:28:00|      7.0|
+-------------------+---------+

I am using the below code for that

window_spec = Window.partitionBy('event_time').orderBy('event_time').rangeBetween(-60*10,0)
new_df = data.withColumn('rank', rank().over(window_spec))
new_df.show()

but this code is giving me the below error:

pyspark.sql.utils.AnalysisException: 'Window Frame specifiedwindowframe(RangeFrame, -600, currentrow$()) must match the required frame specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$());'

My desired output is

+-------------------+---------+
|         event_time|avg_value|
+-------------------+---------+
|2019-12-29 00:06:00|     21.0|
|2019-12-29 00:16:00|     31.0|
|2019-12-29 00:22:00|     16.5|
+-------------------+---------+

Can someone please help me on this?

TIA.

murtihash murtihash · Accepted Answer · 2020-04-23T17:07:21

You could use a groupBy with a window.

from pyspark.sql import functions as F
df.groupBy(F.window("event_time","10 minutes"))\
  .agg(F.max("avg_value").alias("avg_value")).show()

#+--------------------+---------+
#|              window|avg_value|
#+--------------------+---------+
#|[2019-12-29 00:20...|     16.5|
#|[2019-12-29 00:10...|     31.0|
#|[2019-12-29 00:00...|     21.0|
#+--------------------+---------+

To get the exact output of event_time column you desire you could use collect_list, array_sort and element_at(spark2.4+)

from pyspark.sql import functions as F
df.groupBy(F.window("event_time","10 minutes"))\
  .agg(F.element_at(F.array_sort(F.collect_list("event_time")),-2).alias("event_time"),\
       F.max("avg_value").alias("avg_value")).drop("window").orderBy("event_time").show()

#+-------------------+---------+
#|event_time         |avg_value|
#+-------------------+---------+
#|2019-12-29 00:06:00|21.0     |
#|2019-12-29 00:16:00|31.0     |
#|2019-12-29 00:26:00|16.5     |
#+-------------------+---------+

UPDATE:

df.groupBy(F.window("event_time","10 minutes"))\
  .agg(F.collect_list(F.struct("event_time","avg_value")).alias("event_time")\
       ,F.max("avg_value").alias("avg_value"))\
  .withColumn("event_time", F.expr("""filter(event_time, x-> x.avg_value=avg_value)"""))\
        .select((F.col("event_time.event_time")[0]).alias("event_time"),"avg_value").orderBy("event_time").show()

#+-------------------+---------+
#|         event_time|avg_value|
#+-------------------+---------+
#|2019-12-29 00:06:00|     21.0|
#|2019-12-29 00:16:00|     31.0|
#|2019-12-29 00:22:00|     16.5|
#+-------------------+---------+

pySpark - get max value row in a rolling window

2 Answers

Your data

Solution