Hadoop MapReduce: MapTasks vs. Mapper

Question

I've reading lots of documentations and asking questions about Hadoop recently, but there's just one thing that I don't understand.

In the following two scenarios, what exactly happens?

Common Configuration

File Size = 1GB
Hadoop 1.2.1 installed on a quad-core with hyper-threading
Hadoop runs in pseudo-distributed mode

Scenario 1

Split Size = 1GB => there's only one map task
mapred.tasktracker.map.task.maximum = 4

My understanding is that even though up to 4 mappers can be run simultaneously in this node, I only have one MapTask, so it utilizes only 1 mapper. (1 process)

MapTask < Mapper

Scenario 2

This is what I'm confused the most.. - Split Size = 250MB => there are four map tasks - mapred.tasktracker.map.task.maximum = 1

In this case, what actually happened in my case was that it ran a lot faster than Scenario 1, and ran with more processes. I'm confused because I understand that MapTasks can be run simultaneously, but isn't it also bounded by the number of mappers? So, in this case, I thought it would look like this, and have the similar execution time results.

mapper processes map task 1 ----> done
mapper processes map task 2 ----> done
mapper processes map task 3 ----> done
mapper processes map task 4 ----> done

enter image description here

Question

When I have more MapTasks than mappers, what exactly happens???

Have you looked at the JobTracker interface? That tells you all kinds of things about your job. You have the right understanding of how tasks get allocated. When you have more tasks than slots, they get filled up and then when they finished they get replaced by new tasks. — Donald Miner
Does each MapTask get filled (meaning that mapping is done for this block of file?) one by one? If so, I wonder why the scenario 2 ran faster? — kabichan
Scenario 2 is running faster for other reasons. Your logic is right. — Donald Miner
@DonaldMiner Is it because the file was read and parsed simultaneously by four MapTasks? — kabichan

Donald Miner Donald Miner · Accepted Answer · 2013-12-05T23:28:06

So, I'll answer your question, but it doesn't explain what you are seeing in terms of performance difference.

When I have more MapTasks than mappers, what exactly happens?

If you have more map tasks than map slots, you are correct: map slots will be allocated to maximum. Once a map task is completed, the JobTracker assigns the next map task in the open map slot.

Splitting it up into four could be faster, even if it sequential for a few reasons... Perhaps the buffer spilling behavior is different because of the different size of the data. Hard to tell what is going on with the information provided.