My Yarn Map-Reduce Job is taking a lot of time

Question

Input File size : 75GB

Number of Mappers : 2273

Number of reducers : 1 (As shown in the web UI)

Number of splits : 2273

Number of Input files : 867

Cluster : Apache Hadoop 2.4.0

5 nodes cluster, 1TB each.

1 master and 4 Datanodes.

It's been 4 hrs. now and still only 12% of map is completed. Just wanted to know given my cluster configuration does this make sense or is there anything wrong with the configuration?

Yarn-site.xml

         <property>
             <name>yarn.nodemanager.aux-services</name>
             <value>mapreduce_shuffle</value>
             </property>
             <property>
             <name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
             <value>org.apache.hadoop.mapred.ShuffleHandler</value>
             </property>
             <property>
             <name>yarn.resourcemanager.resource- tracker.address</name>
             <value>master:8025</value>
             </property>
             <property>
             <name>yarn.resourcemanager.scheduler.address</name>
             <value>master:8030</value>
             </property>
             <property>
              <name>yarn.resourcemanager.scheduler.address</name>
             <value>master:8030</value>
             </property>
             <property>
             <name>yarn.resourcemanager.address</name>
             <value>master:8040</value>
             </property>
             <property>
             <name>yarn.resourcemanager.hostname</name>
             <value>master</value>
             <description>The hostname of the RM.</description>
             </property>
             <property>
             <name>yarn.scheduler.minimum-allocation-mb</name>
             <value>1024</value>
             <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
             </property>
             <property>
             <name>yarn.scheduler.maximum-allocation-mb</name>
             <value>8192</value>
             <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
             </property>
             <property>
             <name>yarn.scheduler.minimum-allocation-vcores</name>
             <value>1</value>
             <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
             </property>
             <property>
             <name>yarn.scheduler.maximum-allocation-vcores</name>
             <value>32</value>
             <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
             </property>
             <property>
             <name>yarn.nodemanager.resource.memory-mb</name>
             <value>8192</value>
             <description>Physical memory, in MB, to be made available to running containers</description>
             </property>
             <property>
             <name>yarn.nodemanager.resource.cpu-vcores</name>
             <value>4</value>
             <description>Number of CPU cores that can be allocated for containers.</description>
             </property>
             <property>
             <name>yarn.nodemanager.vmem-pmem-ratio</name>
             <value>4</value>
             </property> 
             <property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
   <description>Whether virtual memory limits will be enforced for containers</description>
</property>

Map-Reduce job where I am using multiple outputs. So reducer will emit multiple files. Each machine has 15GB Ram. Containers running are 8. Total memory available is 32GB in RM Web UI.

Any guidance is appreciated. Thanks in advance.

Can you provide information about what type of job you are running?? And also what is the RAM available in each machine. And can you login to the resource manager UI and check the total memory available to the cluster and the number of containers running in parallel. I suspect the job is under utilizing the resources. — Shivanand Pawar
@shivanand pawar : Map-Reduce job where I am using multiple outputs. So I will have multiple files. Each machine has 15GB Ram. Containers running are 8. Total memory available is 32GB. — Shash

Marco99 Marco99 · Accepted Answer · 2016-02-19T12:20:56

A few points to check:

The block & split size seems very small considering the data you shared. Try increasing both to an optimal level.
If not used, use a custom partitioner that would uniformly spread your data across reducers.
Consider using combiner.
Consider using appropriate compression (while storing mapper results)
Use optimum number of block replication.
Increase the number of reducers as appropriate.

These will help increase performance. Give a try and share your findings!!

Edit 1: Try to compare the log generated by a successful map task with that of the long running map task attempt. (12% means 272 map tasks completed). You will get to know where it got stuck.

Edit 2: Tweak these parameters: yarn.scheduler.minimum-allocation-mb, yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts, mapreduce.task.io.sort.mb, mapreduce.task.io.sort.factor

These will improve the situation. Take trial and error approach.

Also refer: Container is running beyond memory limits

Edit 3: Try to understand a part of the logic, convert it to pig script, execute and see how it behaves.

My Yarn Map-Reduce Job is taking a lot of time

1 Answers