How can I make Hadoop v2 use the same mapper to process multiple blocks?

Question

In Brief:

I have files with chunks the same size as the HDFS block size, each chunk is independent but must be provided whole to a mapper. Since my Mapper's setup function consumes a significant amount of time, how can I configure my Mappers to process multiple blocks/chunks before being discarded, whilst also exploiting data locality?

In Long:

I'm trying to use Hadoop to process large numbers of large files in big chunks, something that hadoop is excellent at. Each chunk of each input file can be processed totally separately, but each chunk must be taken whole. To make this work well under Hadoop I've made it so that each chunk is exactly the size of the Hadoop block. I've thus developed 'BlockInputFormat' and 'BlockRecordReader', to surrender up entire blocks to the Mapper at a time. This appears to work well.

The issue I face is that my Mapper tasks (by necessity) have a significant amount of work to do in the setup method, and then the 'map' function is only called once before the whole object is discarded. I have tried increasing the minimum split size via mapreduce.input.fileinputformat.split.minsize which reduces the number of setup calls so that I'm calling setup once per input file (since each input file ends up in it's own InputSplit anyway). My concern is that in doing this I will have lost the benefit of data locality that MapReduce provides, since I think this means InputSplit spans blocks which aren't necessarily on the Mapper's machine.

In summary, my question is: How can I configure a Mapper to read multiple blocks (perhaps even from different input files) whilst preserving data locality? Would I be better off putting each chunk into its own file?

Thanks for any help you can provide, Phil

Stacey Morgan Stacey Morgan · Accepted Answer · 2016-04-26T13:24:54

The Mapper is get assigned based on number of blocks or your input split.
Use CombineFileInputFormat() to combine your input files into a single split so that one mapper will process your data.
Also you should set the max split size property to keep Hadoop from combining the entire input into a single split.
If you are dealing with no. of small files then it's good.
- In that case you need to extend CombineFileInputFormat and implement the getRecordReader method by return CombineFileRecordReader.

How can I make Hadoop v2 use the same mapper to process multiple blocks?

2 Answers