What happens if InputSplit size differs from Block size?

Question

Let's say block size is the default 64MB when storing in HDFS. Now I change InputSplit size to 128MB.

One of the data nodes has only 1 block of information stored locally. The JobTracker gives it a mapper. How does it run map() on a 128MB size split?

donut donut · Accepted Answer · 2013-12-22T05:11:24

128 MB file, with 64 MB block size --> Defaults --> 2 Map tasks 128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 Map task

You could do that, but you would lose locality. The reason the default split algorithm sticks to block boundaries is such that each task individually processes one block alone, and the scheduler can do a more effective job in making the task run where this individual block resides.

When you override min-spit-size and make the split carry two blocks worth of offset + length, then the two blocks could be residing at different nodes but the task will run only at one node, leading to non-data-local processing, which could end up being slower.

What happens if InputSplit size differs from Block size?

2 Answers