0
votes

I'm testing a MapReduce program to see how the execution time changes as I change the number of mappers.

Hadoop 1.2.1 is installed on a quad-core machine with hyper-threading. The MR program is written in Python, so I'm using Hadoop-streaming to run this program. The file size is about 500MB.

In the mapred-site.xml file, I added the following configurations:

mapred.max.split.size : 250MB
mapred.tasktracker.map.tasks.maximum : 1 //1, 2, 4, 8, 16, 32
mapred.tasktracker.reduce.tasks.maximum : 2 

Since I set the split size half the file size, the number of map task should be 2.

My understanding is that there are up to 2 map tasks reading and parsing the data assigned to them.

when there is one mapper: Maptask1 and Maptask2 parses the data concurrently, but there's only one mapper to map. So the mapper needs to do two waves. (work twice)

Now, my assumption was that when the number of mappers is increased: Maptask1 and Maptask2 parses the data concurrently, mapper1 can process the output of Maptask1, and mapper2 can process the output of Maptask2, so both mappers can process concurrently.

However, I see no difference in the execution time. I tried with 1, 2, 4, 8, 16, 32, and the time difference is all within 1 second.

Could someone please explain why??

2

2 Answers

0
votes

The question is if you have enough working threads I think. You need a thread for the jobtracker, the namenode, the tasktracker and the datanode. I don't think given your current configuration you can expect a speedup if your hardware doesn't back it. If you run 1000 threads on a machine with 4 cores for example, your maximum speedup will still be 4. A way of checking if everything is properly configured would be to add a log statement in the map task and check whether 1,2,4,... are started simultaneously.

0
votes

I'm guessing your single input file has been compressed using gzip and you are running into the fact that gzip is not splittable. One gzipped file is limited to one mapper, no more.

See also: Hadoop gzip compressed files