I am trying to use PySpark to find the average difference between adjacent list of tuples.
For example if I have a RDD like so
vals = [(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to find the average difference for each key.
For example for key value "2"
The average difference would be (abs(110-130) + abs(130-120))/2 = 15.
This is my approach so far. I am trying to change the average calculation code to accommodate for this instead. But it doesn't seem to be working.
from pyspark import SparkContext
aTuple = (0,0)
interval = vals.aggregateByKey(aTuple, lambda a,b: (abs(a[0] - b),a[1] + 1),
lambda a,b: (a[0] + b[0], a[1] + b[1]))
finalResult = interval.mapValues(lambda v: (v[0]/v[1])).collect()
I want to do this using the RDD functions, no Spark SQL or any other additional packages.
What would be the best way to do this?
Please let me know if you have any questions.
Thank you for your time.