Repartition followed by coalesce is not honored

Question

I would like to spin up a lot of tasks when doing my calculation but coalesce into a smaller set of partitions when writing to the table.

A simple example for a demonstration is given below, where repartition is NOT honored during the execution.

My expected output is that the map operation happens in 100 partitions and finally collect happens in only 10 partitions.

It seems Spark has optimized the execution by ignoring the repartition. It would be helpful if someone can explain how to achieve my expected behavior.

sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).coalesce(10).collect()

Gopal Gopal · Accepted Answer · 2019-06-14T18:35:40

Instead of coalesce, using repartition helps to achieve the expected behavior.

sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).cache().repartition(10).collect()

This helps to solve my problem. But, still would appreciate an explanation for this behavior.

Repartition followed by coalesce is not honored

2 Answers