Is it possible to do repartition after using partitionBy in a spark DF?

Question

I am asking this question because if I specify repartition as 5, than all my data(>200Gigs) are moved to 5 different executors and 98% of the resources is unused. and then the partitionBy is happening which is again creating a lot of shuffle. Is there a way that first the partitionBy happens and then repartition runs on the data?

thebluephantom thebluephantom · Accepted Answer · 2019-02-08T13:53:32

Although the question is not entirely easy to follow, the following aligns with the other answer and this approach should avoid the issues mentioned on unnecessary shuffling:

val n = [... some calculation for number of partitions / executors based on cluster config and volume of data to process ...]

df.repartition(n, $"field_1", $"field_2", ...)
  .sortWithinPartitions("fieldx", "field_y")
  .write.partitionBy("field_1", "field_2", ...)
  .format("location")

whereby [field_1, field_2, ...] are the same set of fields for repartition and partitionBy.

Is it possible to do repartition after using partitionBy in a spark DF?

2 Answers