Average By key in Python/Spark

Question

I am trying to do average based on key and I have been provided data as below:

data = [
    {"x":10,"y":30},{"x":20,"y":40}
]

So far tried

df=sc.parallelize(data)
df.groupByKey().mapValues(lambda x:sum(x)/len(x)).collect()

I am getting an error :

org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 17.0 failed 1 times, most recent failure: Lost task 5.0 in stage 17.0 (TID 141, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Expected output :

 {"x":15,"y":35}

As we are averaging by key x has 10 and 20 as values, 10+20/2 =15 ie x:15 and y becomes 30+40/2=35 ie y:35

org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 20.0 failed 1 times, most recent failure: Lost task 5.0 in stage 20.0 (TID 165, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): — Padmaja Kattamuri

Lamanus Lamanus · Accepted Answer · 2020-08-16T04:27:10

Try this.

data = [
    {"x":10,"y":30},{"x":20,"y":40}
]

rdd = spark.sparkContext.parallelize(data)
val = rdd.flatMap(lambda line: (line.items())).groupByKey().mapValues(lambda x : sum(x)/len(x)).collect()

dict(val)

{'x': 15.0, 'y': 35.0}

Average By key in Python/Spark

1 Answers