Amazon EMR - Time taken to terminate cluster after last job completion

Question

I launch a EMR cluster with the following specs :

1 master node m4.4xlarge with EBS Storage 32 GB
10 core nodes m4.4xlarge with EBS Storage 1024 GB
Auto termination after last job completion

A Spark job is associated. It reads data from S3 and save output data in S3.

After several attempts, it appears that each time, the Spark job terminates in about 1 hour and 15 minutes (I can see the jobs completed in Spark Web UI and I can see the output in S3 which is good). But the EMR cluster hangs between 20 to 30 minutes before shutting down. So, overall, it takes 1 hour and 45 minutes.

Why EMR cluster takes so much time to terminate after the last job completion ?

Depending on the configuration of the cluster, it may take up to 5-20 minutes for the cluster to completely terminate and release allocated resources. docs.aws.amazon.com/emr/latest/ManagementGuide/… — vvg
@Rumoku Thank you. The latency I am referring to is between the end of Spark job and the beginning of the termination process. My understanding is the link you gave mentions that the termination process in itself can take 5-20 minutes. But my issue is that it takes 20-30 minutes to start the termination process (In the EMR UI, I need to wait more than 20 minutes to start seeing my cluster in Termination status). — Comencau
Are you by chance also using Redis or another external resource? Try to add sys.exit(0) at the end of your code to force termination. — Glennie Helles Sindholt

JellyBeans JellyBeans · Accepted Answer · 2018-11-06T19:50:46

We had a similar issue - called spark.stop() & System.exit() at the end of code, the job completed (was watching it live in a terminal), the web UI shut down, the _SUCCESS token was written and yet the application just sat there and was only marked as complete in the Hadoop Resource Manager 10-40 minutes later.

It ended up being a network issue, which I fixed by increasing the following:

--conf spark.rpc.message.maxSize=512 (default: 128)
--conf spark.network.timeout=600 (default: 120s)
--conf spark.executor.heartbeatInterval=30s (default: 10s)

One quick way to check is grepping the executor logs - we saw a bunch of the following warnings which tipped me off

yarn logs -applicationId <app_id> | grep WARN
...
WARN Executor: Issue communicating with driver in heartbeater

Amazon EMR - Time taken to terminate cluster after last job completion

1 Answers