How to find a median value in pyspark

Question

These are the values of my dateframe:

+-------+----------+
|     ID| Date_Desc|
+-------+----------+
|8951354|2012-12-31|
|8951141|2012-12-31|
|8952745|2012-12-31|
|8952223|2012-12-31|
|8951608|2012-12-31|
|8950793|2012-12-31|
|8950760|2012-12-31|
|8951611|2012-12-31|
|8951802|2012-12-31|
|8950706|2012-12-31|
|8951585|2012-12-31|
|8951230|2012-12-31|
|8955530|2012-12-31|
|8950570|2012-12-31|
|8954231|2012-12-31|
|8950703|2012-12-31|
|8954418|2012-12-31|
|8951685|2012-12-31|
|8950586|2012-12-31|
|8951367|2012-12-31|
+-------+----------+

I tried to create a median value of a date column in pyspark:

df1 = df1.groupby('Date_Desc').agg(f.expr('percentile(ID, array(0.25))')[0].alias('%25'),
                             f.expr('percentile(ID, array(0.50))')[0].alias('%50'),
                             f.expr('percentile(ID, array(0.75))')[0].alias('%75'))

But I get this an error:

Py4JJavaError: An error occurred while calling o198.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 29.0 failed 1 times, most recent failure: Lost task 1.0 in stage 29.0 (TID 427, 5bddc801333f, executor driver): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '11/23/04 9:00' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

Does this answer your question? to_date fails to parse date in Spark 3.0 — mck
The error could not be reproduced with the provided sample data and code snippet. It's probably coming from some date parsing before this transformation. You can refer to the linked question. — blackbishop
Thank you very much, i was able to sucessfully change the data column after following the procedures on the link above. — michelfelippin

00schneider 00schneider · Accepted Answer · 2021-03-17T16:07:37

With Spark ≥ 3.1.0 :

from pyspark.sql.functions import percentile_approx

df1.groupBy("Date_Desc").agg(percentile_approx("ID", 0.5).alias("%50"))

How to find a median value in pyspark

1 Answers