0
votes

The input to my Hadoop program is a set of small files (10 files, each of size 60MB) and I run 100 mappers. I assume that the input data for each mapper comes from ONLY one file. That is, there is no mapper whose input data spans two (or more) files. Is this a correct assumption?

2

2 Answers

2
votes

Yes. You are correct. You could also use a CombineFileInputFormat to access content from multiple files in a single mapper invocation.

By the way, you can look at the mapper task ID which is made up of the name of the file (among other things) being read by the mapper.

1
votes

The input to my Hadoop program is a set of small files (10 files, each of size 60MB) and I run 100 mappers.

The total number of mappers cannot be explicitly controlled. The total number of mappers equals to the number of blocks. So, not sure I run 100 mappers mean.

I assume that the input data for each mapper comes from ONLY one file.

A mapper processes a block of data and a file can be split into 1 or 1+ blocks based on the size of the data.

That is, there is no mapper whose input data spans two (or more) files.

By using the CombineFileInputFormat, a single mapper would be able to process more than one file.