0
votes

If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce? I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?

1
Does this answer your question? Spark - repartition() vs coalesce() - Robert Kossendey

1 Answers

0
votes

We can't decide this based on specific parameter. There will be multiple factors there to decide how many partitions and repartition and coalescence.

  • Based on the size of data: If size of the file is too big, you can give 2 or 3 partitions per block to increase the performance. But if give more too many partitions, it splits as small files. In Big data, small files will lower performance. 1 Block (128 MB) --> 128/2 = 64MB each partition, So 1 mapper will run for 64 MB.

  • Based on the cluster size: If you have a larger number of executors/cores which are free, you can give according to that.

  • Repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.