In Spark difference between repartition(1) and coalesce(1)

Question

In our project, we are using repartition(1) to write data into table, I am interested to know why coalesce(1) cannot be used here because repartition is costly operation compared to coalesce. I know repartition distributes data evenly across partitions but when the output file is of single part file, why can't we use coalesce(1) ? please help me understand if any other factors are involved in this

thebluephantom thebluephantom · Accepted Answer · 2021-09-12T08:33:16

You state nothing else in terms of logic.

coalesce will use existing partitions to minimize shuffling. In case of coalsece(1) and counterpart may be not a big deal, but one can take this guiding principle that repartition creates new partitions and hence does a full shuffle. That said, coalsece can be said to minimize the amount of shuffling.
In my spare time I chanced upon this https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908 excellent article. Look for the quote: Coalesce sounds useful in some cases, but has some problems.

In Spark difference between repartition(1) and coalesce(1)

1 Answers