0
votes

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."

Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.

I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20

"I've had success reducing the number of workers."

However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.

Did anyone have any success who faced this issue? How would I go about solving it?

Shortened stacktrace:

py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
    at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
    ...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
    at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
    at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
    ...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...

Part of Glue ETL python script (just in case):

datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")

... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.

partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()

Solved(Kind of), thank you to user ayazabbas

Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.

Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.

Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).

1

1 Answers

0
votes

I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.

I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.

I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.

Hope that helps.