Spark groupby, sort values, then take first and last

Question

I'm using Apache Spark and have a dataframe that looks like this:

scala> df.printSchema
root
 |-- id: string (nullable = true)
 |-- epoch: long (nullable = true)


scala> df.show(10)
+--------------------+-------------+
|                 id |        epoch|
+--------------------+-------------+
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1533180241049|
|6825a28d-abe5-4b9...|1533926790847|
|6825a28d-abe5-4b9...|1532977853736|
|6825a28d-abe5-4b9...|1532531733106|
|1eb5f3a4-a68c-4af...|1535383198000|
|1eb5f3a4-a68c-4af...|1535129922000|
|1eb5f3a4-a68c-4af...|1534876240000|
|1eb5f3a4-a68c-4af...|1533840537000|
+--------------------+-------------+
only showing top 10 rows

I want to group by the id field to get all the epoch timestamps together for an id. I then want to sort the epochs by ascending timestamp and then take the first and last epochs.

I used the following query, but the first and last epoch values appear to be taken in the order that they appear in the original dataframe. I want the first and last to be taken from a sorted ascending order.

scala> val df2 = df2.groupBy("id").
                 agg(first("epoch").as("first"), last("epoch").as("last"))

scala> df2.show()
+--------------------+-------------+-------------+                              
|                  id|        first|         last|
+--------------------+-------------+-------------+
|4f433f46-37e8-412...|1535342400000|1531281600000|
|d0cba2f9-cc04-42c...|1535537741000|1530448494000|
|6825a28d-abe5-4b9...|1533926790847|1532531733106|
|e963f265-809c-425...|1534996800000|1534996800000|
|1eb5f3a4-a68c-4af...|1535383198000|1530985221000|
|2e65a033-85ed-4e4...|1535660873000|1530494913413|
|90b94bb0-740c-42c...|1533960000000|1531108800000|
+--------------------+-------------+-------------+

How do I retrieve the first and last from the epoch list sorted by ascending epoch?

I will later use string values, not just the numeric epoch. Will min and max also work for strings? — stackoverflowuser2010

user10531058 user10531058 · Accepted Answer · 2018-10-19T21:08:08

first and last functions are meaningless when applied outside Window context. The value which is taken is purely arbitrary.

Instead you should

Use min / max functions if the logic conforms to basic ordering rules (alphanumeric for strings, arrays, and structs, numeric for numbers).
Strongly typed dataset with map -> groupByKey -> reduceGroups or groupByKey -> mapGroups otherwise.

Spark groupby, sort values, then take first and last

2 Answers