Writing RDD partitions to individual parquet files without shuffling

Question

I have a spark dataframe df with 20 partitions and each partition has one day worth of data. This is to say that my input dataframe is already partitioned by day. My objective is to write a parquet file which is also partitioned by day. If I try the below command:

df.repartition(5).write.mode("overwrite").partitionBy(['day']).parquet("path")

There is a lot of shuffling happening whereas my input dataframe is already partitioned. Please note that this dataframe contains more than 1 billion rows and it is killing my executors due to shuffling.

Is there is a way i can write each partition as is into a parquet file without any shuffle ?

by using coalesce(5), you are reducing the number of partitions, in your case, you no need to use coalesce, just remove it and try — hprakash
@hprakash Without coalesce, the number of partitions in each day are very small. Anyways, even without the coalese, there is still a lot of reshuffling happening as spark does not know that my input data is already partitioned by day. — kitz

Yosi Dahari Yosi Dahari · Accepted Answer · 2021-09-06T12:15:04

Is there is a way I can write each partition as is into a parquet file without any shuffle

Answer: no. repartition does a full shuffle and creates new partitions. coalesce avoids a full shuffle, but still has to shuffle data to achieve a new partition (with some heuristics how to minimize the shuffle).

Can you reduce shuffle? Yes. Why repartition(5) is needed? Looks like a low hanging fruit here is just remove it, as it creates a full shuffle. If more context is given about df there are additional optimizations that can be done here.

Writing RDD partitions to individual parquet files without shuffling

1 Answers