ReduceByKey and parititionBy in Spark

Question

In Learning Spark book they write:

For operations that act on a single RDD, such as reduceByKey(), running on a pre-partitioned RDD will cause all the values for each key to be computed locally on a single machine, requiring only the final, locally reduced value to be sent from each worker node back to the master.

However, in the this answer the author is saying, no pre-partitioning is needed because:

For reduceByKey(), the first quality aggregates elements of the same key with the provided associative reduce function locally first on each executor and then eventually aggregated across executors.

So, why does a book suggestes pre-partitioning if reduceByKey() will anyway aggregares elements on each executor first without shuffeling the data?

Garvit Garvit · Accepted Answer · 2018-09-30T13:24:49

The answer above pretty much summed up thereduceByKey and partitionBy methods.

To Answer your question, you do not need to apply partitionBy before calling reduceByKey.

ReduceByKey and parititionBy in Spark

2 Answers