All the books and blogs I have read so far does not provide much info about reduce task assignment. It looks like the reduce task assignments to available slots are random.
This does not make sense as shuffling data across network without considering data(map) locality goes against hadoop design principles.
There is a good chance (not a definite possibility) blocks from the same file are placed within the same rack or nearby racks. So, map tasks for those splits/blocks will also be in those racks (most of the times).
If this is a possible scenario, why not try to assign reduce tasks to slots in the same rack/s as map tasks?
Wouldn't this improve performance in a 1000+ node cluster? Particularly when the input is a sequence or map file.
Can anyone please confirm that reducers are placed randomly is correct (the definitive edition book says so)? If yes, why that decision was made? If i am wrong? then the logic of how reducers are assigned...links to some docs explaining that logic would be nice too.
Thanks a lot in advance.
Arun