Processing different files on separate nodes using Hadoop MapReduce

Question

I have used Pig and Hive before but am new to Hadoop MapReduce. I need to write an application which has multiple small sized files as input (say 10). They have different file structures, so I want to process them parallelly on separate nodes so that they can be processed quickly. I know that the strong point of Hadoop is processing large data but these input files, though small, require a lot of processing so I was hoping to leverage Hadoop's parallel computing prowess. Is this possible?

how small are these files and what kind of processing you are going to perform? — Tariq
The files are pretty small, from 1 - 20 KB. And we have to perform a lot of different checks to ensure that the file is in correct format and is not corrupt. — aa8y
Can this be achieved by partitioning (maybe based on filename). Please answer this question of mine, in which I am encountering a problem while partitioning the data. stackoverflow.com/questions/14193646/… — aa8y
possible duplicate of Getting Filename/FileData as key/value input for Map when running a Hadoop MapReduce Job — Charles

Jeffrey Theobald Jeffrey Theobald · Accepted Answer · 2012-12-28T14:56:39

It is possible but you're probably not going to get much value. You have these forces against you:

Confused input

You'll need to write a mapper which can handle all of the different input formats (either by detecting the input format, or using the filename of the input to decide which format to expect)

Multiple outputs

You need to either use the slightly tricky multiple output file handling functionality of Hadoop or write your output as a side effect of the reducer (or mapper if you can be sure that each file will go to a different node)

High Cost of initialization

Every hadoop map reduce job comes with a hefty start up cost, about 30 seconds on a small cluster, much more on a larger cluster. This point alone probably will lose you more time than you could ever hope to gain by parallelism.

Processing different files on separate nodes using Hadoop MapReduce

2 Answers

Confused input

Multiple outputs

High Cost of initialization