6
votes

If I have an RDD of tuples with 5 elements, e.g., RDD(Double, String, Int, Double, Double)

How can I sort this RDD efficiently using the fifth element?

I tried to map this RDD into key-value pairs and used sortByKey, but looks like sortByKey is quite slow, it is slower than I collected this RDD and used sortWith on the collected array. Why is it like this?

Thank you very much.

3
"it is slower than I collected this RDD and used sortWith on the collected array." Of course it is. If you collect it, everything's on one node and you're then doing an in-memory sort Spark is for big datasets that don't fit on one node, and there's a (considerable) overhead compared to single-node computation. If you don't have that big a data set, you probably don't want to use Spark. It's not a magic "make things faster" solution. - The Archetypal Paul
Thank you for your explanation. - Carter

3 Answers

10
votes

You can do this with sortBy acting directly on the RDD:

myRdd.sortBy(_._5) // Sort by 5th field of each 5-tuple

There are extra optional parameters to define sort order ("ascending") and number of partitions.

3
votes

If you want to sort by descending order & if the corresponding element is of type int, you can use "-" sign to sort the RDD in descending order.

For ex:

I've a RDD of tuple with (String, Int). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => -x._2).collect().foreach(println);

I've a RDD of tuple with (String, String). To sort this RDD by its 2nd element in descending order,

rdd.sortBy(x => x._2, false).collect().foreach(println);
1
votes

sortByKey is the only distributed sorting API for Spark 1.0.

How much data are you trying to sort? Small amount will result in faster local/centralized sorting. If you try to sort GB and GB of data that may not even fit on a single node, that's where Spark shines.