I have a RDD in Spark (python code below):
list1 = [(1,1),(10,100)]
df1 = sc.parallelize(list1)
df1.take(2)
## [(1, 1), (10, 100)]
I want to do a custom sort, that compares these tuples based on both entries in the tuple. In python the logic of this compare is something like:
# THRESH is some constant
def compare_tuple(a, b):
center = a[0] - b[0]
dev = a[1] + b[1]
r = center / dev
if r < THRESH:
return -1
else if r == THRESH:
return 0
else:
return 1
And I would do a custom sort in python as:
list1.sort(compare_tuple)
How to do this in pyspark? As per the rdd docs:
https://spark.apache.org/docs/1.4.1/api/python/pyspark.html#pyspark.RDD
The sortBy method has no custom sort argument.
I see that the scala interface sortBy supports this:
https://spark.apache.org/docs/1.4.1/api/scala/index.html#org.apache.spark.rdd.RDD
But I want this in python spark. Any workaround type solutions are also welcome, thanks!