Spark Repartition and Coalesce

Question

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce? I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?

Does this answer your question? Spark - repartition() vs coalesce() — Robert Kossendey

C Kondaiah C Kondaiah · Accepted Answer · 2021-09-30T14:06:34

We can't decide this based on specific parameter. There will be multiple factors there to decide how many partitions and repartition and coalescence.

Based on the size of data: If size of the file is too big, you can give 2 or 3 partitions per block to increase the performance. But if give more too many partitions, it splits as small files. In Big data, small files will lower performance. 1 Block (128 MB) --> 128/2 = 64MB each partition, So 1 mapper will run for 64 MB.
Based on the cluster size: If you have a larger number of executors/cores which are free, you can give according to that.
Repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.

Spark Repartition and Coalesce

1 Answers