In Brief:
I have files with chunks the same size as the HDFS block size, each chunk is independent but must be provided whole to a mapper. Since my Mapper's setup
function consumes a significant amount of time, how can I configure my Mappers to process multiple blocks/chunks before being discarded, whilst also exploiting data locality?
In Long:
I'm trying to use Hadoop to process large numbers of large files in big chunks, something that hadoop is excellent at. Each chunk of each input file can be processed totally separately, but each chunk must be taken whole. To make this work well under Hadoop I've made it so that each chunk is exactly the size of the Hadoop block. I've thus developed 'BlockInputFormat' and 'BlockRecordReader', to surrender up entire blocks to the Mapper at a time. This appears to work well.
The issue I face is that my Mapper tasks (by necessity) have a significant amount of work to do in the setup
method, and then the 'map' function is only called once before the whole object is discarded. I have tried increasing the minimum split size via mapreduce.input.fileinputformat.split.minsize
which reduces the number of setup calls so that I'm calling setup once per input file (since each input file ends up in it's own InputSplit anyway). My concern is that in doing this I will have lost the benefit of data locality that MapReduce provides, since I think this means InputSplit spans blocks which aren't necessarily on the Mapper's machine.
In summary, my question is: How can I configure a Mapper to read multiple blocks (perhaps even from different input files) whilst preserving data locality? Would I be better off putting each chunk into its own file?
Thanks for any help you can provide, Phil