Overwrite a Parquet file with Pyspark

Question

I am trying to overwrite a Parquet file in S3 with Pyspark. Versioning is enabled for the bucket.

I am using the following code:

Write v1:

df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')

Update v2:

df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1.... <- transform
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')

But when I read df_v2 it contains data from both writes. Furthermore when df_v1 is written I can see one part-xxx.snappy.parquet file, after writing df_v2 I can see two. It behaves as an append rather than overwrite.

What am I missing ? Thanks

Spark = 2.4.4 Hadoop = 2.7.3

Steven Steven · Accepted Answer · 2020-09-08T09:46:43

The problem comes probably from the fact that you are using S3. in S3, the file system is key/value based, which means that there is no physical folder named file1.parquet, there are only files whose keys are something like s3a://bucket/file1.parquet/part-XXXXX-b1e8fd43-ff42-46b4-a74c-9186713c26c6-c000.parquet (that's just an example).

So when you "overwrite", you are supposed to overwrite the folder, which cannot be detected. Therefore, spark creates new keys: it is like an "append" mode.

You probably need to write your own function that overwrite the "folder" - delete all the keys that contains the folder in their name.

Overwrite a Parquet file with Pyspark

1 Answers