1
votes

Let's say block size is the default 64MB when storing in HDFS. Now I change InputSplit size to 128MB.

One of the data nodes has only 1 block of information stored locally. The JobTracker gives it a mapper. How does it run map() on a 128MB size split?

2

2 Answers

1
votes

128 MB file, with 64 MB block size --> Defaults --> 2 Map tasks 128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 Map task

You could do that, but you would lose locality. The reason the default split algorithm sticks to block boundaries is such that each task individually processes one block alone, and the scheduler can do a more effective job in making the task run where this individual block resides.

When you override min-spit-size and make the split carry two blocks worth of offset + length, then the two blocks could be residing at different nodes but the task will run only at one node, leading to non-data-local processing, which could end up being slower.

0
votes

In this situation you are effectively making sure that part of the data that needs to go to a mapper is not local to the node on which that mapper is running.

The hadoop framework will make sure there mapper gets the data but it will mean an increase in network traffic.