I have a pyspark script which reads in unpartioned single parquet file from s3, does some transformations and writes back to a another s3 bucket as partitioned by date.
Im using s3a to do the read and write. Reading in the files and performing the transformations is fine and no problem. However, when i try to write out to s3 using s3a and partitioned it throws the following error:
WARN s3a.S3AFileSystem: Found file (with /): real file? should not happen: folder1/output org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://bucket1/folder1/output' since it is a file.
The part of the code im using to write is as follows where im trying to append to existing directory but a new partition for new date:
output_loc = "s3a://bucket1/folder1/output/"
finalDf.write.partitionBy("date", "advertiser_id") \
.mode("append") \
.parquet(output_loc)
Im using Hadoop v3.0.0 and Spark 2.4.1
Has anyone come across this issue when using s3a instead of s3n. BTW it works fine on a older instance using s3n.
Thanks