Consider this scenario:
I have 4 files each 6 MB each. HDFS
block size is 64 MB.
1 block
will hold all these files. It has some extra space. If new files are added, it will accommodate here
Now when the input splits
are calculated for Map-reduce
job by Input format
, (split size
are usually HDFS block size
so that each split can be loaded into memory for processing, there by reducing seek time.)
how many input splits are made here:
is it one because all the 4 files are contained with in a
block
?or is it one input split per file?
how is this determined? what if I want all files to be processed as a single input split?