I am trying to groupBy and then calculate percentile on pyspark dataframe. I tested the following piece of code according to this stackoverflow post:
from pyspark.sql.types import FloatType
import pyspark.sql.functions as func
import numpy as np
qt_udf = func.udf(lambda x,qt: float(np.percentile(x,qt)), FloatType())
df_out = df_in.groupBy('Id').agg(func.collect_list('value').alias('data'))\
.withColumn('median', qt_udf(func.col('data'),func.lit(0.5)).cast("string"))
df_out.show()
but get the following error:
Traceback (most recent call last): > df_out.show() ....> return lambda *a: f(*a) AttributeError: 'module' object has no attribute 'percentile'
This is because of numpy version (1.4.1), the percentile function was added from version 1.5. It is not possible to update numpy version in the short term.