Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

Question

I have configured a 3-nodes-cluster to run wordcount mapreduce program. I am using a book, whose size is 659 kb (http://www.gutenberg.org/ebooks/20417) as the test data. Interestingly, in the web UI of that Job, only 1 map, 1 reduce and 1 node is involved. I am wondering if this is because the data size is too small. If yes, could I set manually to split the data into different maps on multi nodes?

Thanks, Allen

Artem Tsikiridis Artem Tsikiridis · Accepted Answer · 2013-12-04T02:09:32

The default block size is 64 MB. So yes, the framework does assign only one task of each kind because your input data is smaller.

1) You can either give input data that are more than 64 MB and see what happens.

2) Change the value of mapred.max.split.size which is specific for the mapreduce jobs (in mapred-site.xml or running the job with the -D mapred.max-split.size=noOfBytes) or

3) Change the value of dfs.block.size which has a more global scope and applies for all the HDFS. (in hdfs-site.xml)

Don't forget to restart your cluster to apply changes in case you are modifying the conf files.

Why only 1 map and 1 reduce task and 1 node is used in a Hadoop job?

1 Answers