Utilizing all cores on single-machine Hadoop

Question

Currently I'm working on loading data into a Titan graph with Hadoop (Titan version 0.5.4, Hadoop version 2.6.0). I'm using a single-server (pseudo-distributed) Hadoop cluster, with the purpose of extending to a full cluster with more machines of the same hardware. I'm trying to setup Hadoop in such a way that I get full core utilization. Until now, I though I had made some decent setup with good configuration parameters, but when Hadoop is executing and loads data into the Titan graph, I don't see full utilization of all cores on my machine.

The situation is as follows. The machine I'm using has the following hardware specifications:

CPU: 32 cores
RAM: 256GB
Swap memory: 32GB
Drives: 8x128GB SSD, 4x2TB HDD

The data I'm loading into a Titan graph with Hadoop has the following specifications:

Total size: 848MB
Split into four files (487MB, 142MB, 219MB and 1.6MB), each containing vertices of one single type, together with all the vertex properties and outgoing edges.

While setting up the Hadoop cluster, I tried to use some logic reasoning for setting the configuration parameters of Hadoop to their (what I think is the) optimal setting. See this reasoning below.

My machine has 32 cores, so in theory I could split up my input size into chuncks of which the size is big enough to end up with around 32 chuncks. So, for 848MB of input, I could set dfs.block.size to 32MB, which would lead to around (848MB / 32MB ~ ) 27 chunks.
In order to ensure that each map task receives one chunck, I set the value of mapred.min.split.size to a bit less than the block size, and mapred.max.split.size to a bit more than the block size (for example 30MB and 34MB, respectively).
The available memory needed per task is a bit vague for me. For example, I could set mapred.child.java.opts to a value of -Xmx1024m to give each task (e.g. each mapper/reducer) 1GB of memory. Given that my machine has 256GB memory in total - subtracting some from it to reserve for other purposes leaving me around 200GB of memory - I could end up with a total of (200GB / 1GB = ) 200 mappers and reducers. Or, when I give each task 2GB of memory, I would end up with a total of 100 mappers and reducers. The amount of memory given to each task also depends on the input size, I guess. Anyway, this leads to values for mapred.tasktracker.map/reduce.tasks.maximum of around 100, which might already be too much given the fact I have only 32 cores. Therefore, maybe setting this parameter to 32 for both map and reduce might be better? What do you think?

After these assumptions, I end up with the following configuration.

hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.block.size</name>
    <value>33554432</value>
    <description>Specifies the sizeof data blocks in which the input dataset is split.</description>
  </property>
</configuration>

mapred-site.xml

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
    <description>The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn.</description>
  </property>
  <property>
    <name>mapred.child.java.opts</name>
    <value>-Xmx2048m</value>
    <description>Java opts for the task tracker child processes.</description>
  </property>
  <property>
    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>32</value>
    <description>The maximum number of map tasks that will be run simultaneously by a tasktracker.</description>
  </property>
  <property>
    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>32</value>
    <description>The maximum number of reduce tasks that will be run simultaneously by a tasktracker.</description>
  </property>
    <property>
    <name>mapred.min.split.size</name>
    <value>31457280</value>
    <description>The minimum size chunk that map input should be split into.</description>
  </property>
  <property>
    <name>mapred.max.split.size</name>
    <value>35651584</value>
    <description>The maximum size chunk that map input should be split into.</description>
  </property>
  <property>
    <name>mapreduce.job.reduces</name>
    <value>32</value>
    <description>The default number of reducers to use.</description>
  </property>
  <property>
    <name>mapreduce.job.maps</name>
    <value>32</value>
    <description>The default number of maps to use.</description>
  </property>
</configuration>

yarn-site.xml

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>2048</value>
    <description>The minimum allocation for every container request at the RM, in MBs.</description>
  </property>
</configuration>

Executing Hadoop with these settings does not give my full core utilization on my single machine. Not all cores are busy throughout all MapReduce phases. During the Hadoop execution, I also took a look at the IO throughput using the iostat command (iostat -d -x 5 3 giving me three reports of 5 second intervals). A sample of such a report is shown below.

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.07    0.02    0.41     0.29     2.37    12.55     0.01   16.92    5.18   17.43   2.47   0.10
sdb               0.07     2.86    4.90   10.17   585.19  1375.03   260.18     0.04    2.96   23.45    8.55   1.76   2.65
sdc               0.08     2.83    4.89   10.12   585.48  1374.71   261.17     0.07    4.89   30.35    8.12   2.08   3.13
sdd               0.07     2.83    4.89   10.10   584.79  1374.46   261.34     0.04    2.78   26.83    6.71   1.94   2.91
sde               0.00     0.00    0.00    0.00     0.05     0.80   278.61     0.00   10.74    2.55   32.93   0.73   0.00
sdf               0.00     0.00    0.00    0.00     0.05     0.80   283.72     0.00   10.30    1.94   33.09   0.68   0.00
sdg               0.00     0.00    0.00    0.00     0.05     0.80   283.83     0.00   10.24    1.99   32.75   0.68   0.00
sdh               0.00     0.00    0.00    0.00     0.05     0.80   284.13     0.00   10.29    1.96   32.99   0.69   0.00
sdi               0.00     0.00    0.00    0.00     0.05     0.80   284.87     0.00   17.89    2.35   60.33   0.74   0.00
sdj               0.00     0.00    0.00    0.00     0.05     0.80   284.05     0.00   10.30    2.01   32.96   0.68   0.00
sdk               0.00     0.00    0.00    0.00     0.05     0.80   284.44     0.00   10.20    1.99   32.62   0.68   0.00
sdl               0.00     0.00    0.00    0.00     0.05     0.80   284.21     0.00   10.50    2.00   33.71   0.69   0.00
md127             0.00     0.00    0.04    0.01     0.36     6.38   279.84     0.00    0.00    0.00    0.00   0.00   0.00
md0               0.00     0.00   14.92   36.53  1755.46  4124.20   228.57     0.00    0.00    0.00    0.00   0.00   0.00

I'm no expert in disk utilization, but could these value mean that I'm IO-bound somewhere, for example on disks sdb, sbc or sdd?

Edit: maybe a better indication of CPU utilization and IO throughput can be given by using the sar command. Here are results for 5 reports, 5 seconds aprt (sar -u 5 5):

11:07:45 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:07:50 AM     all     12.77      0.01      0.91      0.31      0.00     86.00
11:07:55 AM     all     15.99      0.00      1.39      0.56      0.00     82.05
11:08:00 AM     all     11.43      0.00      0.58      0.04      0.00     87.95
11:08:05 AM     all      8.03      0.00      0.69      0.48      0.00     90.80
11:08:10 AM     all      8.58      0.00      0.59      0.03      0.00     90.80
Average:        all     11.36      0.00      0.83      0.28      0.00     87.53

Thanks in advance for any reply!

facha facha · Accepted Answer · 2015-10-01T10:48:37

Set this parameter in yarn-site.xml to a number of cores you machine has:

<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>32</value>
</property>

Then run pi from hadoop-examples jar and observe with resource manager's web page how many mappers are being executed at the same time

Utilizing all cores on single-machine Hadoop

1 Answers