0
votes

I had an AWS Glue Job with ETL script in pyspark which wrote dynamic frame to redshift as a table and to s3 as json. One of the column in this df is status_date. I had no issue in writing this df.

I then had a requirement to add two more columns financial_year and financial_qarter based on the status_date. For this I created a udf added two new columns. Using printSchema() and show() I saw the columns were successfully created and values in the columns were collect.

The problem came when I tried to write this to aws s3 and aws redshift. It is giving a weird error which I am not able to troubleshoot.

Error if I write to Redshift - An error occurred while calling o177.pyWriteDynamicFrame. File already exists:s3://aws-glue-temporary-***********-ap-south-1/ShashwatS/491bb37a-404a-4ec5-a459-5534d94b0206/part-00002-af529a71-7315-4bd1-ace5-91ab5d9f7f46-c000.csv

Error if I write to s3 as json - An error occurred while calling o175.json. File already exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055/part-00026-fd481713-2ccc-4d23-98b0-e96908cb708c-c000.json

As you can see both the errors are similar kind. Below is the error trace. Some help required.

> org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 3 in stage 503.0 failed 4 times, most recent failure: Lost task
> 3.3 in stage 503.0 (, executor 15):
> org.apache.hadoop.fs.FileAlreadyExistsException: File already
> exists:s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-00003-055f9327-1368-4ba7-9216-6a32afac1843-c000.json
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
>   at
> com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
>   
2
The error message says that the target files already exists. These are probably temporary files which will be used to copy data to Redshift. Have you checked if the files exist in that S3 bucket? If they exist, have you tried to remove them?Gokhan Atil
Hi Gokhan, a blank folder with timestamp as folder name is dynamically created each time the job runs. So, there are no similar files there.Also, this error occurs only if I include the lines for pyspark udf and add columns using withColumn. Without these two, I am facing any issue.Shashwat Tiwary
Were you able to resolve this issue? I'm facing a similar problem.Scrotch

2 Answers

0
votes

Are u trying to write to the same location in your code twice ?? Eg : If u have already written to location s3://bucket_name/folder1/folder2/folder3/folder4/24022020124055 once in your code and are trying to do it again in the same code ,it will not work.

0
votes

I had this error and took me a couple of days to find the cause. My Issue was caused by the file format/data types of the S3 Files. Create UTF8 files or change all your data files format into UTF-8.

Or find the "bad" record by ordering the records in source S3 files by ID or some uniqueidentifier and then by looking at the last file created example s3://bucket_name/folder1/folder2/folder3/folder4/24022020150358/part-xxxx

find last record in this file and correspond that to the source file. In my case the prblm record was 3 to 7 lines past that last record that was imported. If its a special character then you need to change your file format.

Or a quick check is to just rmove that special character and recheck the output file if its gone past the previous "bad" record.