Apache Spark dataframe does not repartition while writing to parquet

Question

I am trying to partition my DataFrame and write it to parquet file. It seems to me, that repartitioning works on dataframe in memory, but does not affect the parquet partitioning. What is even more strange that coalesce works. Lets say I have DataFrame df:

df.rdd.partitions.size
4000
var df_new = df.repartition(20)
df_new.rdd.partitions.size
20

However, when I try to write a parquet file I get the following:

df_new.write.parquet("test.paruqet") 
[Stage 0:>                        (0 + 8) / 4000]

which would create 4000 files, however, if I do this I get the following:

var df_new = df.coalesce(20)
df_new.write.parquet("test.paruqet")
[Stage 0:>                        (0 + 8) / 20]

I can get what I want to reduce partitions. The problem is when I need to increase the number of partitions I cannot do it. Like if I have 8 partitions and I try to increase them to 100, it always write only 8.

Does somebody know how to fix this?

Lucas Lucas · Accepted Answer · 2019-07-19T14:27:20

First of all, you should not provide a file path to the parquet() method, but a folder instead. Spark will handle the parquet filenames on its own.

Then, you must be aware that coalesce only reduces the number of partitions (without shuffle) while repartition lets you re-partition (with shuffle) your DataFrame in any number of partitions you need (more or less). Check out this SO question for more details on repartition vs. coalesce.

In your case, you want to increase the number of partitions, so you need to use repartition

df.repartition(20).write.parquet("/path/to/folder")

Apache Spark dataframe does not repartition while writing to parquet

1 Answers