2
votes

I'm currently writing distributed application which parses Pdf files with the help of Hadoop MapReduce. Input to MapReduce job is thousands of Pdf files (which mostly range from 100KB to ~2MB), and output is a set of parsed text files.

For testing purposes, initially I used WholeFileInputFormat provided in Tom White's Hadoop. The Definitive Guide book, which provides single file to single map. This worked fine with small number of input files, however, it does not work properly with thousands of files for obvious reasons. Single map for the task which takes around a second to complete is inefficient.

So, what I want to do is to submit several Pdf files into one Map (for example, combining several files into single chunk which has around HDFS block size ~64MB). I found out that CombineFileInputFormat is useful for my case. However I cannot come out with idea how to extend that abstract class, so that I can process each file and its filename as a single Key-Value record.

Any help is appreciated. Thanks!

2

2 Answers

1
votes

I think a SequenceFile will suit your needs here: http://wiki.apache.org/hadoop/SequenceFile

Essentially, you put all your PDFs into a sequence file and the mappers will receive as many PDFs as fit into one HDFS block of the sequence file. When you create the sequence file, you'll set the key to be the PDF filename, and the value will be the binary representation of the PDF.

0
votes

You can create text files with HDFS pathes to your files and use it as an input. It will give Your the mapper reuse for many file, but will cost the data locality. If your data is relatively small, high replication factor (close to number of data nodes) will solve the problem.