DataFrame --- join / groupBy-agg - partition

Question

I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions) and its variants, I used to have a way to provide number of partitions

In DataFrame, how do I specify the number of partitions during this operation? I could use repartition() after the fact. But this would be another Stage in the Job.

One work around to increase the number of partitions / task during a join is to set 'spark.sql.shuffle.partitions' it some desired number during spark-submit. I am trying to see if there is a way to provide this programmatically for every step of a groupBy-agg / join?

Reason to do it programmatically is so that, depending on the size of the dataframe, I can use a larger or smaller number of tasks to avoid OutOfMemoryError.

tiho tiho · Accepted Answer · 2018-06-09T23:08:58

AFAIK you can't specify a number of partitions at each step, but:

Spark will try to re-use an existing partitioning if it exists, so if you repartition before doing a groupBy for instance, it should use whatever number of partitions you specified (assuming you're using the same key of course). Same for a join, if both dataframes are partitioned with the same key (that must be the join key) & same # of partitions, it won't reshuffle.
Otherwise you can indeed tweak spark.sql.shuffle.partitions

DataFrame --- join / groupBy-agg - partition

1 Answers