Run wordcount on Hadoop Cluster slower than on Eclipse

Question

I have the Hadoop (version: 2.5.0) cluster with 3 machines.

Topology: 10.0.0.1 NameNode, DataNode 10.0.0.2 DataNode 10.0.0.3 DataNode

Configured as below:

Core-site

<configuration>
        <property>
                <name>fs.defaultFS</name>
                <value>hdfs://10.0.0.1/</value>
                <final>true</final>
        </property>
</configuration>

hdfs-site.xml

<configuration>
        <property>
                <name>dfs.replication</name>
                <value>2</value>
        </property>
        <property>
                <name>dfs.namenode.name.dir</name>
                <value>file:///home/tuannd/hdfs/namenode</value>
                <final>true</final>
        </property>
        <property>
                <name>dfs.datanode.data.dir</name>
                <value>file:///home/tuannd/hdfs/datanode</value>
                <final>true</final>
        </property>
        <property>
                <name>dfs.permissions</name>
                <value>false</value>
        </property>
</configuration>

mapred-site.xml

<configuration>
        <property>
                <name>mapreduce.framework.name</name>
                <value>yarn</value>
        </property>
        <property>
                <name>mapredure.jobtracker.address</name>
                <value>10.0.0.1:9001</value>
                <final>true</final>
        </property>
        <property>
                <name>mapredure.cluster.local.dir</name>
                <value>/tmp/hadoop/mapredure/system</value>
                <final>true</final>
        </property>
        <property>
                <name>mapreduce.tasktracker.map.tasks.maximum</name>
                <value>7</value>
                <final>true</final>
        </property>
        <property>
                <name>mapreduce.tasktracker.reduce.tasks.maximum</name>
                <value>7</value>
                <final>true</final>
        </property>
        <property>
                <name>mapredure.map.tasks</name>
                <value>100</value>
        </property>
        <property>
                <name>mapredure.task.timeout</name>
                <value>0</value>
        </property>
        <property>
                <name>mapreduce.map.java.opts</name>
                <value>-Xmx512M</value>
        </property>
        <property>
                <name>mapreduce.reduce.java.opts</name>
                <value>-Xmx1024M</value>
        </property>
</configuration>

yarn-site.xml

<configuration>
<property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
</property>
<property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>8192</value>
</property>
</configuration>

slaves

10.0.0.1
10.0.0.2
10.0.0.3

start-all.sh. On Master:

19817 Jps
15240 ResourceManager
12521 SecondaryNameNode
12330 DataNode
12171 NameNode
15381 NodeManager

On Slaves:

24454 NodeManager
22828 DataNode
24584 Jps

Code wordcount: the same this link

With the same input data.

On Eclipse (master machine): Processing in 9s.
On Hadoop cluster: Processing in 30s.

I don't know that what wrong on Hadoop cluster configure file? Timing processing data on Hadoop cluster slower than on eclipse!

Thanks.

Thanks your comment. I edited. I don't know that what wrong on Hadoop cluster configure file? Timing processing data on Hadoop cluster slower than on eclipse! — user3671651
Try with some GigaBytes of input data... Hadoop is bad at scaling down. — vefthym

vefthym vefthym · Accepted Answer · 2014-09-17T09:43:30

Hadoop is bad at scaling down to small data. Since the process finishes in 9 seconds, I assume you have a small input. Try running your program with some GBs of input data and you will see some great difference.

Consider the cost of initializing the tasks and communication cost (network) between your nodes, that are absent in the local version.

Tip: You can also use SumReducer as a Combiner and see a good speed boost when running big data.

UPDATE: If you are using exactly the code that you provide as a link, then the problem is that you are using a single reducer (by default). You will see the benefits of parallelization if you use more reduce tasks (job.setNumReduceTasks(num);), where num can be specified according to the directions provided here (theses are just instructions, not rules).

Run wordcount on Hadoop Cluster slower than on Eclipse

1 Answers