How to sort an RDD of tuples with 5 elements in Spark Scala?

Question

If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

"it is slower than I collected this RDD and used sortWith on the collected array." Of course it is. If you collect it, everything's on one node and you're then doing an in-memory sort Spark is for big datasets that don't fit on one node, and there's a (considerable) overhead compared to single-node computation. If you don't have that big a data set, you probably don't want to use Spark. It's not a magic "make things faster" solution. — The Archetypal Paul

Shadowlands Shadowlands · Accepted Answer · 2015-10-13T07:24:47

You can do this with sortBy acting directly on the RDD:

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

How to sort an RDD of tuples with 5 elements in Spark Scala?

3 Answers