Currently I'm working on loading data into a Titan graph with Hadoop (Titan version 0.5.4, Hadoop version 2.6.0). I'm using a single-server (pseudo-distributed) Hadoop cluster, with the purpose of extending to a full cluster with more machines of the same hardware. I'm trying to setup Hadoop in such a way that I get full core utilization. Until now, I though I had made some decent setup with good configuration parameters, but when Hadoop is executing and loads data into the Titan graph, I don't see full utilization of all cores on my machine.
The situation is as follows. The machine I'm using has the following hardware specifications:
- CPU: 32 cores
- RAM: 256GB
- Swap memory: 32GB
- Drives: 8x128GB SSD, 4x2TB HDD
The data I'm loading into a Titan graph with Hadoop has the following specifications:
- Total size: 848MB
- Split into four files (487MB, 142MB, 219MB and 1.6MB), each containing vertices of one single type, together with all the vertex properties and outgoing edges.
While setting up the Hadoop cluster, I tried to use some logic reasoning for setting the configuration parameters of Hadoop to their (what I think is the) optimal setting. See this reasoning below.
- My machine has 32 cores, so in theory I could split up my input size into chuncks of which the size is big enough to end up with around 32 chuncks. So, for 848MB of input, I could set
dfs.block.size
to 32MB, which would lead to around (848MB / 32MB ~ ) 27 chunks. - In order to ensure that each map task receives one chunck, I set the value of
mapred.min.split.size
to a bit less than the block size, andmapred.max.split.size
to a bit more than the block size (for example 30MB and 34MB, respectively). - The available memory needed per task is a bit vague for me. For example, I could set
mapred.child.java.opts
to a value of-Xmx1024m
to give each task (e.g. each mapper/reducer) 1GB of memory. Given that my machine has 256GB memory in total - subtracting some from it to reserve for other purposes leaving me around 200GB of memory - I could end up with a total of (200GB / 1GB = ) 200 mappers and reducers. Or, when I give each task 2GB of memory, I would end up with a total of 100 mappers and reducers. The amount of memory given to each task also depends on the input size, I guess. Anyway, this leads to values formapred.tasktracker.map/reduce.tasks.maximum
of around 100, which might already be too much given the fact I have only 32 cores. Therefore, maybe setting this parameter to 32 for bothmap
andreduce
might be better? What do you think?
After these assumptions, I end up with the following configuration.
hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.block.size</name>
<value>33554432</value>
<description>Specifies the sizeof data blocks in which the input dataset is split.</description>
</property>
</configuration>
mapred-site.xml
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>The runtime framework for executing MapReduce jobs. Can be one of local, classic or yarn.</description>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx2048m</value>
<description>Java opts for the task tracker child processes.</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>32</value>
<description>The maximum number of map tasks that will be run simultaneously by a tasktracker.</description>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>32</value>
<description>The maximum number of reduce tasks that will be run simultaneously by a tasktracker.</description>
</property>
<property>
<name>mapred.min.split.size</name>
<value>31457280</value>
<description>The minimum size chunk that map input should be split into.</description>
</property>
<property>
<name>mapred.max.split.size</name>
<value>35651584</value>
<description>The maximum size chunk that map input should be split into.</description>
</property>
<property>
<name>mapreduce.job.reduces</name>
<value>32</value>
<description>The default number of reducers to use.</description>
</property>
<property>
<name>mapreduce.job.maps</name>
<value>32</value>
<description>The default number of maps to use.</description>
</property>
</configuration>
yarn-site.xml
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>2048</value>
<description>The minimum allocation for every container request at the RM, in MBs.</description>
</property>
</configuration>
Executing Hadoop with these settings does not give my full core utilization on my single machine. Not all cores are busy throughout all MapReduce phases. During the Hadoop execution, I also took a look at the IO throughput using the iostat
command (iostat -d -x 5 3
giving me three reports of 5 second intervals). A sample of such a report is shown below.
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 0.07 0.02 0.41 0.29 2.37 12.55 0.01 16.92 5.18 17.43 2.47 0.10
sdb 0.07 2.86 4.90 10.17 585.19 1375.03 260.18 0.04 2.96 23.45 8.55 1.76 2.65
sdc 0.08 2.83 4.89 10.12 585.48 1374.71 261.17 0.07 4.89 30.35 8.12 2.08 3.13
sdd 0.07 2.83 4.89 10.10 584.79 1374.46 261.34 0.04 2.78 26.83 6.71 1.94 2.91
sde 0.00 0.00 0.00 0.00 0.05 0.80 278.61 0.00 10.74 2.55 32.93 0.73 0.00
sdf 0.00 0.00 0.00 0.00 0.05 0.80 283.72 0.00 10.30 1.94 33.09 0.68 0.00
sdg 0.00 0.00 0.00 0.00 0.05 0.80 283.83 0.00 10.24 1.99 32.75 0.68 0.00
sdh 0.00 0.00 0.00 0.00 0.05 0.80 284.13 0.00 10.29 1.96 32.99 0.69 0.00
sdi 0.00 0.00 0.00 0.00 0.05 0.80 284.87 0.00 17.89 2.35 60.33 0.74 0.00
sdj 0.00 0.00 0.00 0.00 0.05 0.80 284.05 0.00 10.30 2.01 32.96 0.68 0.00
sdk 0.00 0.00 0.00 0.00 0.05 0.80 284.44 0.00 10.20 1.99 32.62 0.68 0.00
sdl 0.00 0.00 0.00 0.00 0.05 0.80 284.21 0.00 10.50 2.00 33.71 0.69 0.00
md127 0.00 0.00 0.04 0.01 0.36 6.38 279.84 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 14.92 36.53 1755.46 4124.20 228.57 0.00 0.00 0.00 0.00 0.00 0.00
I'm no expert in disk utilization, but could these value mean that I'm IO-bound somewhere, for example on disks sdb, sbc or sdd?
Edit: maybe a better indication of CPU utilization and IO throughput can be given by using the sar
command. Here are results for 5 reports, 5 seconds aprt (sar -u 5 5
):
11:07:45 AM CPU %user %nice %system %iowait %steal %idle
11:07:50 AM all 12.77 0.01 0.91 0.31 0.00 86.00
11:07:55 AM all 15.99 0.00 1.39 0.56 0.00 82.05
11:08:00 AM all 11.43 0.00 0.58 0.04 0.00 87.95
11:08:05 AM all 8.03 0.00 0.69 0.48 0.00 90.80
11:08:10 AM all 8.58 0.00 0.59 0.03 0.00 90.80
Average: all 11.36 0.00 0.83 0.28 0.00 87.53
Thanks in advance for any reply!