13
votes


I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.

Running a test job I encounter the following problem:
Almost all map tasks are marked as successful but with note saying Container killed.

On the online interface the log for the map jobs says:
Progress 100.00
State SUCCEEDED

but under Note it says for almost every attempt (~200)
Container killed by the ApplicationMaster.
or
Container killed by the ApplicationMaster. Container killed on request. Exit code is 143

In the log file associated with the attempt I can see a log saying Task 'attempt_xxxxxxxxx_0' done.

I also get 3 attempts with the same log, only those 3 have
State KILLED
which are under killed jobs.

stderr output is empty for all jobs/attempts.

When looking at the application master log and following one of the successful (but killed) attempts I find the following logs:

  • Transitioned from NEW to UNASSIGNED
  • Transitioned from UNASSIGNED to ASSIGNED
  • several progress updates, including: 1.0
  • Done acknowledgement
  • RUNNING to SUCCESS_CONTAINER_CLEANUP
  • CONTAINER_REMOTE_CLEANUP
  • KILLING attempt_xxxx
  • Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED
  • Task Transitioned from RUNNING to SUCCEEDED

All the attempts are numbered xxxx_0 so I assume they are not killed as a result of speculative execution.

Should I be worried about this? And what causes the containers to be killed? Any suggestions would be greatly appreciated!

3
This still seems to happen every now and then. The output seems fine, but I'm still wondering what is behind this!GebitsGerbils
Question - I would ask this in the comments but I don't have the rep for that: How much memory are these python scripts using? If they use too much, don't they get automatically killed? If I am correct, fixing mapred.child.ulimit setting to unlimited or optimizing your python script may help. -Jimmyjimf
Were you able to solve this, i have a similar problem.Noah Watkins
No, this still happens every now and then...GebitsGerbils
Any solutions? I have a similar problem with Hadoop 2.6 on Mac OS X 10.8.3. I used java code in my map reduce program.mary

3 Answers

2
votes

Yes, I agree with @joshua. It seems to be a bug related to a task/container not dying gracefully after successfully finishing the map/reduce task. After the grace period, the ApplicationMaster has to kill it instead.

I am running 'yarn version'= Hadoop 2.5.0-cdh5.3.0

I picked one of the tasks and grep'ed for its history in the log generated for my MR application:

$ yarn logs -applicationId application_1422894000163_0003 |grep attempt_1422894000163_0003_r_000008_0

You will see that "attempt_1422894000163_0003_r_000008_0" goes through the "TaskAttempt Transitioned from NEW to UNASSIGNED .. to RUNNING to SUCCESS_CONTAINER_CLEANUP'.

In the step 'SUCCESS_CONTAINER_CLEANUP', you will see messages about this container being killed. After this container is killed, this attempt goes into the "TaskAttempt Transitioned from SUCCESS_CONTAINER_CLEANUP to SUCCEEDED" step.

0
votes

As far as I know, the same task is run on many nodes. As soon as one node returnes the result, tasks on onther nodes are killed. That's why job SUCCEEDED but single tasks are in KILLED state.

0
votes

What version are you using? You may have encountered YARN-903: DistributedShell throwing Errors in logs after successfull completion

This is a logging bug only. (The manager is trying to stop already-finished containers.)