Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster

Question

I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.

Running a test job I encounter the following problem:
Almost all map tasks are marked as successful but with note saying Container killed.

On the online interface the log for the map jobs says:
Progress 100.00
State SUCCEEDED

but under Note it says for almost every attempt (~200)
Container killed by the ApplicationMaster.
or
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

In the log file associated with the attempt I can see a log saying Task 'attempt_xxxxxxxxx_0' done.

I also get 3 attempts with the same log, only those 3 have
State KILLED
which are under killed jobs.

stderr output is empty for all jobs/attempts.

When looking at the application master log and following one of the successful (but killed) attempts I find the following logs:

Transitioned from NEW to UNASSIGNED
Transitioned from UNASSIGNED to ASSIGNED
several progress updates, including: 1.0
Done acknowledgement
RUNNING to SUCCESS_CONTAINER_CLEANUP
CONTAINER_REMOTE_CLEANUP
KILLING attempt_xxxx
Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
Task Transitioned from RUNNING to SUCCEEDED

All the attempts are numbered xxxx_0 so I assume they are not killed as a result of speculative execution.

Should I be worried about this? And what causes the containers to be killed? Any suggestions would be greatly appreciated!

This still seems to happen every now and then. The output seems fine, but I'm still wondering what is behind this! — GebitsGerbils
Question - I would ask this in the comments but I don't have the rep for that: How much memory are these python scripts using? If they use too much, don't they get automatically killed? If I am correct, fixing mapred.child.ulimit setting to unlimited or optimizing your python script may help. -Jimmy — jimf
Any solutions? I have a similar problem with Hadoop 2.6 on Mac OS X 10.8.3. I used java code in my map reduce program. — mary

rr9031 rr9031 · Accepted Answer · 2015-02-02T17:57:20

Yes, I agree with @joshua. It seems to be a bug related to a task/container not dying gracefully after successfully finishing the map/reduce task. After the grace period, the ApplicationMaster has to kill it instead.

I am running 'yarn version'= Hadoop 2.5.0-cdh5.3.0

I picked one of the tasks and grep'ed for its history in the log generated for my MR application:

$ yarn logs -applicationId application_1422894000163_0003 |grep attempt_1422894000163_0003_r_000008_0

You will see that "attempt_1422894000163_0003_r_000008_0" goes through the "TaskAttempt Transitioned from NEW to UNASSIGNED .. to RUNNING to SUCCESS_CONTAINER_CLEANUP'.

In the step 'SUCCESS_CONTAINER_CLEANUP', you will see messages about this container being killed. After this container is killed, this attempt goes into the "TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED" step.

Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster

3 Answers