0
votes

I use CDH5.1.0 (hadoop 2.3.0). 2 name nodes (2x 32GB RAM, 2 cores) and 3 data nodes (3x 16GB RAM, 2 cores)

I am scheduling mapreduce jobs from a single user in the default queue (there are no other users and no other queues configured).

When using capacity scheduler, following happens: I am able to submit multiple jobs but only 2 jobs are being executed (status 'running') in parallel.

When using fair scheduler, following happens: I am submitting multiple jobs and 4 jobs are being set to status 'running' by cluster/scheduler. These jobs remain on 5% progress forever. If single jobs are being killed, new job is being set to status 'running' on 5%, again, with no further progress. Jobs start to execute only after there are less than 4 jobs and no further jobs are submitted to the queue.

I have re-configured the cluster multiple times but was never able to increase number of running jobs when using capacity scheduler or avoid hanging-up of jobs when using fair scheduler

My question is - how to configure cluster/yarn/scheduler/dynamic and static resource pools to make scheduling work?

Here are some of the config parameters:

yarn.scheduler.minimum-allocation-mb = 2GB
yarn.scheduler.maximum-allocation-mb = 12GB
yarn.scheduler.minimum-allocation-vcores = 1
yarn.scheduler.maximum-allocation-vcores = 2
yarn.nodemanager.resource.memory-mb = 12GB
yarn.nodemanager.resource.cpu-vcores  = 2
mapreduce.map.memory.mb = 12GB
mapreduce.reduce.memory.mb = 12GB
mapreduce.map.java.opts.max.heap = 9.6GB
mapreduce.reduce.java.opts.max.heap = 9.6GB
yarn.app.mapreduce.am.resource.mb = 12GB
ApplicationMaster Java Maximum Heap Size = 788MB
mapreduce.task.io.sort.mb = 1GB

I have left Static and Dynamic Resource Pools with the default (cloudera) settings (e.g. Max Running Apps setting is empty)

2
Were you able to find an answer for this. I have a similar issue when I use a capacity-scheduler, only one app starts, other are put in pending statesparkDabbler
Unfortunately not. In the mean time I dumped CDH, switched to MapR distribution, dumped that too in favor of apache spark on mesos which made this issue for me obsolete.Reinis

2 Answers

0
votes

NOT A SOLUTION BUT POSSIBLE WORKAROUND

At some point we discussed this issue with Christian Neundorf from MapR consulting and he claimed that there is a dead-lock bug in FairScheduler (not CDH specific but rather in standard hadoop!).

He suggested this solution, but I cannot recall if we tried it. Please use at your own risk, I give no guarantee that this actually work and am posting this only for those of you who are really desperate and willing to try anything to make your app work:

in yarn-site.xml (don't know why this has to be set)

<property>
    <name>yarn.scheduler.fair.user-as-default-queue</name>
    <value>false</value>
    <description>Disable username for default queue </description>
</property>

in fair-scheduler.xml

<allocations>
    <queue name="default">
         <!-- you set an integer value here which is number of the cores at your disposal minus one (or more) -->
        <maxRunningApps>number of cores - 1</maxRunningApps>
   </queue>
</allocations>
0
votes

Decrease these parameters:

mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
yarn.app.mapreduce.am.resource.mb

to 6Gb (and decrease heap sizes accordingly).

With the current configuration you are able to run only three containers (one per node).

YARN job needs at least two containers to run (one container for ApplicationMaster and another for Map or Reduce task). So you could easily hit a situation when you launch tree ApplicationMasters for three different jobs that would hang there forever because you don't have any containers left to perform the actual Map/Reduce processing.

Further on, you should limit the number of applications that can be run in parallell on your cluster (because you don't have that many resources) to 2 or 3.