In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is costly operation compared to coalesce. I know repartition distributes data evenly across partitions but when the output file is of single part file, why can't we use coalesce(1) ? please help me understand if any other factors are involved in this
1
votes
1 Answers
0
votes
You state nothing else in terms of logic.
coalescewill use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle thatrepartitioncreates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.