I'm currently writing distributed application which parses Pdf files with the help of Hadoop MapReduce. Input to MapReduce job is thousands of Pdf files (which mostly range from 100KB to ~2MB), and output is a set of parsed text files.
For testing purposes, initially I used WholeFileInputFormat
provided in Tom White's Hadoop. The Definitive Guide book, which provides single file to single map. This worked fine with small number of input files, however, it does not work properly with thousands of files for obvious reasons. Single map for the task which takes around a second to complete is inefficient.
So, what I want to do is to submit several Pdf files into one Map (for example, combining several files into single chunk which has around HDFS block size ~64MB). I found out that CombineFileInputFormat
is useful for my case. However I cannot come out with idea how to extend that abstract class, so that I can process each file and its filename as a single Key-Value record.
Any help is appreciated. Thanks!