I may be having a naive question on join / groupBy-agg. During the days of RDD, whenever I wanted to perform a. groupBy-agg, I used to say reduceByKey (of PairRDDFunctions) with an optional Partition-Strategy (with is number of partitions or Partitioner) b. join (of PairRDDFunctions) and its variants, I used to have a way to provide number of partitions
In DataFrame, how do I specify the number of partitions during this operation? I could use repartition() after the fact. But this would be another Stage in the Job.
One work around to increase the number of partitions / task during a join is to set 'spark.sql.shuffle.partitions' it some desired number during spark-submit. I am trying to see if there is a way to provide this programmatically for every step of a groupBy-agg / join?
Reason to do it programmatically is so that, depending on the size of the dataframe, I can use a larger or smaller number of tasks to avoid OutOfMemoryError.