Pyspark writing out to partitioned parquet using s3a issue

Question

I have a pyspark script which reads in unpartioned single parquet file from s3, does some transformations and writes back to a another s3 bucket as partitioned by date.

Im using s3a to do the read and write. Reading in the files and performing the transformations is fine and no problem. However, when i try to write out to s3 using s3a and partitioned it throws the following error:

WARN s3a.S3AFileSystem: Found file (with /): real file? should not happen: folder1/output org.apache.hadoop.fs.FileAlreadyExistsException: Can't make directory for path 's3a://bucket1/folder1/output' since it is a file.

The part of the code im using to write is as follows where im trying to append to existing directory but a new partition for new date:

output_loc = "s3a://bucket1/folder1/output/"

finalDf.write.partitionBy("date", "advertiser_id") \
.mode("append") \
.parquet(output_loc)

Im using Hadoop v3.0.0 and Spark 2.4.1

Has anyone come across this issue when using s3a instead of s3n. BTW it works fine on a older instance using s3n.

Thanks

When trying just s3 it errrors with filesystem not found. However this is not the issue it is when writing to s3 using s3a which is the now recommended method — RonD

stevel stevel · Accepted Answer · 2020-01-15T12:23:00

There's an entry in your bucket s3a://bucket1/folder1/output/ with the trailing slash which is size > 0. S3A is warning that it's unhappy as that's treated as an empty-dir marker which is at risk of deletion once you add files underneath.

Look in the S3 bucket from the AWS console, see what is there, delete it
try using the output_loc without a trailing / to see if that helps (unlikely...)

Add a followup on the outcome; if the delete doesn't fix things then a hadoop JIRA may be worth filing

Pyspark writing out to partitioned parquet using s3a issue

1 Answers