What's best way in PySpark to reduce number of output files after shuffle operations?

Question

I have a PySpark RDD groupby transformation, which need to shuffle to 10000 partitions to reduce the memory needed for operation on each partition. But after that, I want to write the output to smaller number of files like 100. If I use coalesce on the output, it will automatically change the shuffle operation to 100 which is slow and causing memory error. The other way I have tried is to persist the rdd after groupby (using count() to force the materialization), and then using coalesce. This way achieved what I'm looking for, but it adds overhead of serializing the RDD to the checkpoint on disk. I have also tried using localCheckpoint(), which from my understanding needs to cache/persist the rdd anyway. So I'm wondering if there is any other better approach? In other words, is there way to force the coalesce only happen after certain shuffle operation?

Yukeshkumar Yukeshkumar · Accepted Answer · 2020-07-08T14:59:32

If the you have number of wide transformation in your app and data size is high I would suggest write as it is in temporary location and read it again write it back that will lead to narrow transformation makes faster while writing.Or you can avoid intermediate writing by calculating shuffle data size in each partition (by default shuffle parrallesim 200) you can increase or decrease this numbers to reduce time.

What's best way in PySpark to reduce number of output files after shuffle operations?

1 Answers