I am trying to partition my DataFrame and write it to parquet file. It seems to me, that repartitioning works on dataframe in memory, but does not affect the parquet partitioning. What is even more strange that coalesce works. Lets say I have DataFrame df:
df.rdd.partitions.size
4000
var df_new = df.repartition(20)
df_new.rdd.partitions.size
20
However, when I try to write a parquet file I get the following:
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 4000]
which would create 4000 files, however, if I do this I get the following:
var df_new = df.coalesce(20)
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 20]
I can get what I want to reduce partitions. The problem is when I need to increase the number of partitions I cannot do it. Like if I have 8 partitions and I try to increase them to 100, it always write only 8.
Does somebody know how to fix this?