0
votes

I have a dataframe where I am aggregating on a column and picking last element but It returns different result every time, Is there way I can resolve this issue, with out diff result everytime and get the right one.

val sourceDF = Seq(
 (11,"a1", "a2"),
 (11,"b1", "b2"),
 (22,"c1", "c2"),
 (22,"d1", "d2"),
 (33,"e1", "e2")
).toDF("id","name", "city")

sourceDF.show(false)

sourceDF.groupBy("id").agg(
last("name"),
last("city")
).show(false)

+---+-----------------+-----------------+                                       
|id |last(name, false)|last(city, false)|
+---+-----------------+-----------------+
|33 |e1               |e2               |
|11 |a1               |a2               |
|22 |c1               |c2               |
+---+-----------------+-----------------+

Thanks in advance.

1
from pyspark last function documentation -> The function is non-deterministic because its results depends on order of rows which may be non-deterministic after a shuffle. Use an order by to get deterministic results. - Vamsi Prabhala
Thanks Vamsi will try - Babu

1 Answers

0
votes

If you have large dataset then this can be an issue try to give an order to your grouping column:

sourceDF.orderBy("id").groupBy("id").agg(
last("name"),
last("city")
).show(false)