0
votes

If I apply a hash partitioner to Spark's aggregatebykey function, i.e. myRDD.aggregateByKey(0, new HashPartitioner(20))(combOp, mergeOp)

Does myRDD get repartitioned first before it's key/value pairs are aggregated using combOp and mergeOp? Or does myRDD go through combOp and mergeOp first and the resulting RDD is repartitioned using the HashPartitioner?

1

1 Answers

3
votes

aggregateByKey applies map side aggregation before eventual shuffle. Since every partition is processed sequentially the only operation that is applied in this phase is initialization (creating zeroValue) and combOp. A goal of mergeOp is to combine aggregation buffers so it is not used before shuffle.

If input RDD is a ShuffledRDD with the same partitioner as requested for aggregateByKey then data is not shuffled at all and data is aggregated locally using mapPartitions.