128 MB file, with 64 MB block size --> Defaults --> 2 Map tasks
128 MB file, with 64 MB block size --> Min split size 128 MB --> 1 Map task
You could do that, but you would lose locality. The reason the default split algorithm sticks
to block boundaries is such that each task individually processes one block alone, and the
scheduler can do a more effective job in making the task run where this individual block resides.
When you override min-spit-size and make the split carry two blocks worth of offset + length,
then the two blocks could be residing at different nodes but the task will run only at one
node, leading to non-data-local processing, which could end up being slower.