The input to my Hadoop program is a set of small files (10 files, each of size 60MB) and I run 100 mappers. I assume that the input data for each mapper comes from ONLY one file. That is, there is no mapper whose input data spans two (or more) files. Is this a correct assumption?
2 Answers
2
votes
1
votes
The input to my Hadoop program is a set of small files (10 files, each of size 60MB) and I run 100 mappers.
The total number of mappers cannot be explicitly controlled. The total number of mappers equals to the number of blocks. So, not sure I run 100 mappers
mean.
I assume that the input data for each mapper comes from ONLY one file.
A mapper processes a block of data and a file can be split into 1 or 1+ blocks based on the size of the data.
That is, there is no mapper whose input data spans two (or more) files.
By using the CombineFileInputFormat, a single mapper would be able to process more than one file.