Both hadoop fair scheduler and capacity scheduler not scheduling as expected

Question

I use CDH5.1.0 (hadoop 2.3.0). 2 name nodes (2x 32GB RAM, 2 cores) and 3 data nodes (3x 16GB RAM, 2 cores)

I am scheduling mapreduce jobs from a single user in the default queue (there are no other users and no other queues configured).

When using capacity scheduler, following happens: I am able to submit multiple jobs but only 2 jobs are being executed (status 'running') in parallel.

When using fair scheduler, following happens: I am submitting multiple jobs and 4 jobs are being set to status 'running' by cluster/scheduler. These jobs remain on 5% progress forever. If single jobs are being killed, new job is being set to status 'running' on 5%, again, with no further progress. Jobs start to execute only after there are less than 4 jobs and no further jobs are submitted to the queue.

I have re-configured the cluster multiple times but was never able to increase number of running jobs when using capacity scheduler or avoid hanging-up of jobs when using fair scheduler

My question is - how to configure cluster/yarn/scheduler/dynamic and static resource pools to make scheduling work?

Here are some of the config parameters:

yarn.scheduler.minimum-allocation-mb = 2GB
yarn.scheduler.maximum-allocation-mb = 12GB
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 2
yarn.nodemanager.resource.memory-mb = 12GB
yarn.nodemanager.resource.cpu-vcores  = 2
mapreduce.map.memory.mb = 12GB
mapreduce.reduce.memory.mb = 12GB
mapreduce.map.java.opts.max.heap = 9.6GB
mapreduce.reduce.java.opts.max.heap = 9.6GB
yarn.app.mapreduce.am.resource.mb = 12GB
ApplicationMaster Java Maximum Heap Size = 788MB
mapreduce.task.io.sort.mb = 1GB

I have left Static and Dynamic Resource Pools with the default (cloudera) settings (e.g. Max Running Apps setting is empty)

Were you able to find an answer for this. I have a similar issue when I use a capacity-scheduler, only one app starts, other are put in pending state — sparkDabbler
Unfortunately not. In the mean time I dumped CDH, switched to MapR distribution, dumped that too in favor of apache spark on mesos which made this issue for me obsolete. — Reinis

Reinis Reinis · Accepted Answer · 2016-03-10T15:46:45

NOT A SOLUTION BUT POSSIBLE WORKAROUND

At some point we discussed this issue with Christian Neundorf from MapR consulting and he claimed that there is a dead-lock bug in FairScheduler (not CDH specific but rather in standard hadoop!).

He suggested this solution, but I cannot recall if we tried it. Please use at your own risk, I give no guarantee that this actually work and am posting this only for those of you who are really desperate and willing to try anything to make your app work:

in yarn-site.xml (don't know why this has to be set)

<property>
    <name>yarn.scheduler.fair.user-as-default-queue</name>
    <value>false</value>
    <description>Disable username for default queue </description>
</property>

in fair-scheduler.xml

<allocations>
    <queue name="default">
         <!-- you set an integer value here which is number of the cores at your disposal minus one (or more) -->
        <maxRunningApps>number of cores - 1</maxRunningApps>
   </queue>
</allocations>

Both hadoop fair scheduler and capacity scheduler not scheduling as expected

2 Answers