If I want to repartition a dataframe, How to decide on the number of partitions that need to be made? How to decide on whether to use repartition or coalesce? I understand that coalesce is basically used only to reduce the number of partitions. But how can we decide which to use in what scenario?
1 Answers
0
votes
We can't decide this based on specific parameter. There will be multiple factors there to decide how many partitions and repartition and coalescence.
Based on the size of data: If size of the file is too big, you can give 2 or 3 partitions per block to increase the performance. But if give more too many partitions, it splits as small files. In Big data, small files will lower performance. 1 Block (128 MB) --> 128/2 = 64MB each partition, So 1 mapper will run for 64 MB.
Based on the cluster size: If you have a larger number of executors/cores which are free, you can give according to that.
Repartition will cause the complete shuffling and coalesce will avoid the complete shuffle.