1
votes

I am running a small private cluster of linux machines with Hadoop 2.6.2 and yarn. I launch yarn jobs from a linux edge node. The canned Yarn example to approximate the value of pi works perfectly when run by the hadoop (superuser, owner of the cluster) user, but fails when run from my personal account on the edge node. In both cases (hadoop, me) I run the job exactly like this:

clott@edge: /home/hadoop/hadoop-2.6.2/bin/yarn jar /home/hadoop/hadoop-2.6.2/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.2.jar pi 2 5

It fails; the full output is below. I think the file-not-found exception is totally bogus. I think something causes the launch of the container to fail, and so there's no output to be found. What causes container launches to fail, and how can this be debugged?

Because this identical same command works perfectly when run by the hadoop user but not when run by a different account on the same edge node, I suspect a permission or other yarn configuration problem; I don't suspect a missing-jar file problem. My personal account uses the same environment variables as the hadoop account, for what that's worth.

These questions are similar but I didn't find a solution:

https://issues.cloudera.org/browse/DISTRO-577

Running a map reduce job as a different user

Yarn MapReduce Job Issue - AM Container launch error in Hadoop 2.3.0

I have tried these remedies without any success:

  1. In core-site.xml, set the value of hadoop.tmp.dir to /tmp/temp-${user.name}

  2. Add my personal user account to every node in the cluster

I guess that many installations run with just a single user, but I'm trying to allow two people to work together on the cluster without trashing each other's intermediate results. Am I totally nuts?

Full output:

Number of Maps  = 2
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Starting Job
15/12/22 15:29:18 INFO client.RMProxy: Connecting to ResourceManager at ac1.mycompany.com/1.2.3.4:8032
15/12/22 15:29:18 INFO input.FileInputFormat: Total input paths to process : 2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: number of splits:2
15/12/22 15:29:19 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1450815437271_0002
15/12/22 15:29:19 INFO impl.YarnClientImpl: Submitted application application_1450815437271_0002
15/12/22 15:29:19 INFO mapreduce.Job: The url to track the job: http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/
15/12/22 15:29:19 INFO mapreduce.Job: Running job: job_1450815437271_0002
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 running in uber mode : false
15/12/22 15:29:31 INFO mapreduce.Job:  map 0% reduce 0%
15/12/22 15:29:31 INFO mapreduce.Job: Job job_1450815437271_0002 failed with state FAILED due to: Application application_1450815437271_0002 failed 2 times due to AM Container for appattempt_1450815437271_0002_000002 exited with  exitCode: 1
For more detailed output, check application tracking page:http://ac1.mycompany.com:8088/proxy/application_1450815437271_0002/Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1450815437271_0002_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
15/12/22 15:29:31 INFO mapreduce.Job: Counters: 0
Job Finished in 13.489 seconds
java.io.FileNotFoundException: File does not exist: hdfs://ac1.mycompany.com/user/clott/QuasiMonteCarlo_1450816156703_163431099/out/reduce-out
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1122)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:1114)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1114)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1817)
at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1841)
at org.apache.hadoop.examples.QuasiMonteCarlo.estimatePi(QuasiMonteCarlo.java:314)
at org.apache.hadoop.examples.QuasiMonteCarlo.run(QuasiMonteCarlo.java:354)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.examples.QuasiMonteCarlo.main(QuasiMonteCarlo.java:363)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:71)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
1
It seems like a permissions issue. Check the answer here for setting the permissions: stackoverflow.com/questions/34418647/…Manjunath Ballur

1 Answers

2
votes

Yes Manjunath Ballur you were right it was a permissions problem! Finally learned how to preserve the yarn application logs, which clearly revealed the problem. Here are the steps:

  1. Edit yarn-site.xml and add a property to delay yarn log deletion:

    <property>
        <name>yarn.nodemanager.delete.debug-delay-sec</name>
        <value>600</value>
    </property>
    
  2. Push yarn-site.xml to all nodes (ARGH I forgot this for a long time) and restart cluster.

  3. Run yarn example to estimate pi as shown above, it fails. Look at http://namenode:8088/cluster/apps/FAILED to see the failed application, click on the link for the most recent failure, look at the bottom to see which nodes in the cluster were used.

  4. Open a window on one of the nodes in the cluster where the app failed. Find the job directory, which in my case was

    ~hadoop/hadoop-2.6.2/logs/userlogs/application_1450815437271_0004/container_1450‌​815437271_0004_01_000001/
    
  5. Et voila, I saw files stdout (only log4j bitching), stderr (nearly empty) and syslog (winner winner chicken dinner). In the syslog file I found this gem:

    2015-12-23 08:31:42,376 INFO [main] org.apache.hadoop.service.AbstractService: Service JobHistoryEventHandler failed in state INITED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=clott, access=EXECUTE, inode="/tmp/hadoop-yarn/staging/history":hadoop:supergroup:drwxrwx---
    

So the problem was permissions on hdfs:///tmp/hadoop-yarn/staging/history. A simple chmod 777 put me right, I'm not fighting the group perms anymore. Now a non-hadoop non-superuser can run a yarn job.