I am trying to do average based on key and I have been provided data as below:
data = [
{"x":10,"y":30},{"x":20,"y":40}
]
So far tried
df=sc.parallelize(data)
df.groupByKey().mapValues(lambda x:sum(x)/len(x)).collect()
I am getting an error :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 17.0 failed 1 times, most recent failure: Lost task 5.0 in stage 17.0 (TID 141, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Expected output :
{"x":15,"y":35}
As we are averaging by key x
has 10
and 20
as values, 10+20/2 =15
ie x:15
and y
becomes 30+40/2=35
ie y:35