0
votes

I would like to spin up a lot of tasks when doing my calculation but coalesce into a smaller set of partitions when writing to the table.

A simple example for a demonstration is given below, where repartition is NOT honored during the execution.

My expected output is that the map operation happens in 100 partitions and finally collect happens in only 10 partitions.

It seems Spark has optimized the execution by ignoring the repartition. It would be helpful if someone can explain how to achieve my expected behavior.

sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).coalesce(10).collect()

Stages Output

2

2 Answers

0
votes

Instead of coalesce, using repartition helps to achieve the expected behavior.

sc.parallelize(range(1,1000)).repartition(100).map(lambda x: x*x).cache().repartition(10).collect()

Repartition

This helps to solve my problem. But, still would appreciate an explanation for this behavior.

0
votes

"Returns a new Dataset that has exactly numPartitions partitions, when (sic) the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. "

Source: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@coalesce(numPartitions:Int):org.apache.spark.sql.Dataset[T]