I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json. One of the column in this df is status_date. I had no issue in writing this df.
I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date. For this I created a udf added two new columns. Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.
The problem came when I tried to write this to aws s3 and aws redshift. It is giving a weird error which I am not able to troubleshoot.
Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv
Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json
As you can see both the errors are similar kind. Below is the error trace. Some help required.
> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
> at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
> at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
> at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>