I'm testing a MapReduce program to see how the execution time changes as I change the number of mappers.
Hadoop 1.2.1 is installed on a quad-core machine with hyper-threading. The MR program is written in Python, so I'm using Hadoop-streaming to run this program. The file size is about 500MB.
In the mapred-site.xml file, I added the following configurations:
mapred.max.split.size : 250MB
mapred.tasktracker.map.tasks.maximum : 1 //1, 2, 4, 8, 16, 32
mapred.tasktracker.reduce.tasks.maximum : 2
Since I set the split size half the file size, the number of map task should be 2.
My understanding is that there are up to 2 map tasks reading and parsing the data assigned to them.
when there is one mapper: Maptask1 and Maptask2 parses the data concurrently, but there's only one mapper to map. So the mapper needs to do two waves. (work twice)
Now, my assumption was that when the number of mappers is increased: Maptask1 and Maptask2 parses the data concurrently, mapper1 can process the output of Maptask1, and mapper2 can process the output of Maptask2, so both mappers can process concurrently.
However, I see no difference in the execution time. I tried with 1, 2, 4, 8, 16, 32, and the time difference is all within 1 second.
Could someone please explain why??