1
votes

Does mapreduce and any of the other hadoop technologies (HBase, Hive, pig etc) lend themselves well to situations where you have multiple input files and where data needs to be compared between the different datasources.

In the past I've written a few mapreduce jobs using Hadoop and Pig. However these tasks were quite simple since they involved manipulating only a single dataset. The requirements we have now, dictates that we read data from multiple sources and perform comparisons on various data elements on another datasource. We then report on the differences. The datasets we are working with are in the region of 10million - 60million records and so far we haven't manage to make these jobs fast enough.

Is there a case for using mapreduce in order to solve such issues or am I going down the wrong route.

Any suggestions are much appreciated.

4
Are the datasets pre-sorted and partitioned? How are the datasets compared (key in the records, or more complex)?Chris White
The data sets are coming from third-party so I can't guarantee sorting order. Basically, I have to match address fields from these sources against a "master" source that we host and based on matches we do certain things. Comparison operations on address fields involve fairly complex string matching logic.swedstar

4 Answers

0
votes

I guess I'd preprocess the different datasets into a common format (being sure to include a "data source" id column with a single unique value for each row coming from the same dataset). Then move the files into the same directory, load the whole dir and treat it as a single data source in which you compare the properties of rows based on their dataset id.

0
votes

Yes, you can join multiple datasets in a mapreduce job. I would recommend getting a copy of the book/ebook Hadoop In Action which addresses joining data from multiple sources.

0
votes

When you have multiple input files you can use MapReduce API FileInputFormat.addInputPaths() in which can take a comma separated list of multiple files, as below:

FileInputFormat.addInputPaths("dir1/file1,dir2/file2,dir3/file3");

You can also pass multiple inputs into a Mapper in hadoop using Distributed Cache, more info is described here: multiple input into a Mapper in hadoop

If i am not misunderstanding you are trying to normalize the structured data in records, coming in from several inputs and then process it. Based on this, i think you really need to look at this article which helped me in past. It included How To Normalize Data Using Hadoop/MapReduce as below:

  • Step 1: Extract the column value pairs from the original data.
  • Step 2: Extract column-value Pairs Not In Master ID File
  • Step 3: Calculate the Maximum ID for Each Column in the Master File
  • Step 4: Calculate a New ID for the Unmatched Values
  • Step 5: Merge the New Ids with the Existing Master IDs
  • Step 6: Replace the Values in the Original Data with IDs
0
votes

Using MultipleInputs we can do this.

MutlipleInputs.addInputPath(job, Mapper1.class, TextInputFormat.class,path1);
MutlipleInputs.addInputPath(job, Mapper2.class, TextInputFormat.class,path2);
job.setReducerClass(Reducer1.class);
//FileOutputFormat.setOutputPath(); set output path here

If both classes have a common key, then they can be joined in reducer and do the necessary logics