Using Hadoop to process data from multiple datasources

Question

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.

In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.

Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.

Any suggestions are much appreciated.

Are the datasets pre-sorted and partitioned? How are the datasets compared (key in the records, or more complex)? — Chris White
The data sets are coming from third-party so I can't guarantee sorting order. Basically, I have to match address fields from these sources against a "master" source that we host and based on matches we do certain things. Comparison operations on address fields involve fairly complex string matching logic. — swedstar

Nicolas78 Nicolas78 · Accepted Answer · 2012-05-31T09:28:38

I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.

Using Hadoop to process data from multiple datasources

4 Answers