I've reading lots of documentations and asking questions about Hadoop recently, but there's just one thing that I don't understand.
In the following two scenarios, what exactly happens?
Common Configuration
- File Size = 1GB
- Hadoop 1.2.1 installed on a quad-core with hyper-threading
- Hadoop runs in pseudo-distributed mode
Scenario 1
- Split Size = 1GB => there's only one map task
- mapred.tasktracker.map.task.maximum = 4
My understanding is that even though up to 4 mappers can be run simultaneously in this node, I only have one MapTask, so it utilizes only 1 mapper. (1 process)
Scenario 2
This is what I'm confused the most.. - Split Size = 250MB => there are four map tasks - mapred.tasktracker.map.task.maximum = 1
In this case, what actually happened in my case was that it ran a lot faster than Scenario 1, and ran with more processes. I'm confused because I understand that MapTasks can be run simultaneously, but isn't it also bounded by the number of mappers? So, in this case, I thought it would look like this, and have the similar execution time results.
mapper processes map task 1 ----> done
mapper processes map task 2 ----> done
mapper processes map task 3 ----> done
mapper processes map task 4 ----> done
Question
When I have more MapTasks than mappers, what exactly happens???