1
votes

Thank you in advance for a any help. I am running a yarn job using provided Hadoop example. The job never completes and stays at the "ACCEPTED" state. Looking at what is being printed out, it seems like the job is waiting to be completed -- and the client continuously probing for the job status.

Example job (from Hadoop 2.6.0):

spark-submit --master yarn-client --driver-memory 4g --executor-memory 2g --executor-cores 4  --class org.apache.spark.examples.SparkPi /home/john/spark/spark-1.6.1-bin-hadoop2.6/lib/spark-examples-1.6.1-hadoop2.6.0.jar 100

Output:

....
....
 disabled; ui acls disabled; users with view permissions: Set(john); users with modify permissions: Set(jogn)
16/07/27 17:36:09 INFO yarn.Client: Submitting application 1 to ResourceManager
16/07/27 17:36:09 INFO impl.YarnClientImpl: Submitted application application_1469665943738_0001
16/07/27 17:36:10 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:10 INFO yarn.Client:
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1469666169333
         final status: UNDEFINED
         tracking URL: http://cpt-bdx021:8088/proxy/application_1469665943738_0001/
         user: john
16/07/27 17:36:11 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:12 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:13 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:14 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:15 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:16 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:17 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:18 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:19 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:20 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:21 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
16/07/27 17:36:22 INFO yarn.Client: Application report for application_1469665943738_0001 (state: ACCEPTED)
...........
...........
...........

UPDATE (Looks like job was submitted to ResourceManager -- hence "ACCEPTED", but ResourceManager "sees" no nodes or hadoop workers to actually get job across to):

$ jps
jps
12404 Jps
12211 NameNode
12315 DataNode
11743 ApplicationHistoryServer
11876 ResourceManager
11542 NodeManager

$ yarn node -list
        16/07/27 23:07:53 INFO client.RMProxy: Connecting to ResourceManager at /192.168.0.5.55:8032
        Total Nodes:0
                 Node-Id             Node-State Node-Http-Address       Number-of-Running-Containers

UPDATE(2):I am using the default etc/container-executor.cfg file:

yarn.nodemanager.linux-container-executor.group=#configured value of yarn.nodemanager.linux-container-executor.group
banned.users=#comma separated list of users who can not run applications
min.user.id=1000#Prevent other super-users
allowed.system.users=##comma separated list of system users who CAN run applications

Also, as I side, I want to mention that I do not have a hadoop user or hadoop` user group. I am using the default account with which I logged on to the system. If that matters. Thanks!


UPDATE(3): NodeManager log

org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at 192.168.0.5.55:8031
2016-07-28 00:23:26,083 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: []
2016-07-28 00:23:26,087 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[]
2016-07-28 00:23:26,233 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -160570002
2016-07-28 00:23:26,236 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id -1876215653
2016-07-28 00:23:26,237 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 192.168.0.5.55:53034 with total resource of <memory:8192, vCores:8>
2016-07-28 00:23:26,237 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests
2
are you running on sandbox? Open Yarn Resource manager and check the job status and number of containers available for running? U might need to increase java memory. - yoga
I think I narrowed the problem down to the ReourceManager not being able to find any nodes. yarn node -list shows 0 total nodes. I updated the question above. - nikk
@yoga, number of containers shows 0. - nikk
are you using hdp sandbox? did u try running in local mode ? - yoga
I am using the Hadoop binary directly from Hadoop site. I am running all processeset as local, all on one physical machine. - nikk

2 Answers

0
votes

Reason why your job never get completed is because it never goes to state RUNNING (from state ACCEPTED). There is a scheduler which takes care of scheduling what apps will get resources and thereby to state RUNNING.

There are two schedulers available: fair-scheduler and capacity-scheduler. You can find details in Hadoop Yarn documentation. If you could provide yarn-site.xml, capacity-scheduler.xml and fair-scheduler.xml files I would give you better help :).

0
votes

The most common possibility is that the queue that you are sending your job to does not have the available resources you are requesting.

Typical problems may be:

  • Resource requirements (memory and/or cores). You're asking for more memory/cores that it is able to allocate. This may be because of a near-full use of cluster, or that your settings are not consistent. More details on this page.

  • Disk space. Check node space, there is a health check that may stop you from running application.

    yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 
    
  • In a multitenant / multiqueue environment, if there are hard resource limits per-queue, your application may be hitting those. You may want to increase your settings, or test in another queue with more resources.