0
votes

The RDD is key-value pair. groupByKey() could create a lot of shuffle which harms the performance. I was wondering how to reduce unnecessary shuffle using groupByKey()

If I first repartition RDD first, and then groupByKey, will it help?

val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=500) )

inputRdd2.groupByKey()

Does partitionBy() also create shuffle? Thanks

1

1 Answers

0
votes

If I first repartition RDD first, and then groupByKey, will it help?

It won't. partitionBy itself is a shuffle, and reduceByKey doesn't apply map side reduction anyway, so overall it won't change a thing.

Unfortunately, in general case, there are no good news for you. If you want groupByKey you have to pay the price. While well designed data collection and ingestion process can increase data locality and reduce shuffles in downstream consumers (like Spark), there is not much you can do on arbitrary input.

On the bright side many groupBy applications can be expressed in different ways, especially if exact results are not required. Different types of probabilistic data structures are probably the most prominent example.