I have a spark dataframe df with 20 partitions and each partition has one day worth of data. This is to say that my input dataframe is already partitioned by day. My objective is to write a parquet file which is also partitioned by day. If I try the below command:
df.repartition(5).write.mode("overwrite").partitionBy(['day']).parquet("path")
There is a lot of shuffling happening whereas my input dataframe is already partitioned. Please note that this dataframe contains more than 1 billion rows and it is killing my executors due to shuffling.
Is there is a way i can write each partition as is into a parquet file without any shuffle ?