I am trying to overwrite a Parquet file in S3 with Pyspark. Versioning is enabled for the bucket.
I am using the following code:
Write v1:
df_v1.repartition(1).write.parquet(path='s3a://bucket/file1.parquet')
Update v2:
df_v1 = spark.read.parquet("s3a://bucket/file1.parquet")
df_v2 = df_v1.... <- transform
df_v2.repartition(1).write.mode("overwrite").parquet('s3a://bucket/file1.parquet')
But when I read df_v2 it contains data from both writes. Furthermore when df_v1 is written I can see one part-xxx.snappy.parquet file, after writing df_v2 I can see two. It behaves as an append rather than overwrite.
What am I missing ? Thanks
Spark = 2.4.4 Hadoop = 2.7.3