Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

Question

I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json. One of the column in this df is status_date. I had no issue in writing this df.

I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date. For this I created a udf added two new columns. Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.

The problem came when I tried to write this to aws s3 and aws redshift. It is giving a weird error which I am not able to troubleshoot.

Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv

Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json

As you can see both the errors are similar kind. Below is the error trace. Some help required.

> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>

The error message says that the target files already exists. These are probably temporary files which will be used to copy data to Redshift. Have you checked if the files exist in that S3 bucket? If they exist, have you tried to remove them? — Gokhan Atil
Hi Gokhan, a blank folder with timestamp as folder name is dynamically created each time the job runs. So, there are no similar files there.Also, this error occurs only if I include the lines for pyspark udf and add columns using withColumn. Without these two, I am facing any issue. — Shashwat Tiwary
Were you able to resolve this issue? I'm facing a similar problem. — Scrotch

Emerson Emerson · Accepted Answer · 2020-02-26T17:37:49

Are u trying to write to the same location in your code twice ?? Eg : If u have already written to location s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 once in your code and are trying to do it again in the same code ,it will not work.

Why is write DataFrame to S3 (or write dynamic frame to Redshift) giving error after adding derived column using UDF in AWS Glue pyspark?

2 Answers