The RDD is key-value pair. groupByKey() could create a lot of shuffle which harms the performance. I was wondering how to reduce unnecessary shuffle using groupByKey()
If I first repartition RDD first, and then groupByKey, will it help?
val inputRdd2 = inputRdd.partitionBy(new HashPartitioner(partitions=500) )
inputRdd2.groupByKey()
Does partitionBy() also create shuffle? Thanks