Hadoop data join from two files - How to force mappers to read specific files

Question

I am trying to write a data join Map Reduce job in hadoop. I feel I am close but am having an issue preventing map1 from feeding in to map2.

I have two mappers and a single reduce and am trying to force Map1 to read from one file while forcing Map2 to read from another. I would like to parse the results in the reducer to format the join output.

I know by default when chaining mappers in a job the output of a job will be the input of the next, I know this can be overridden but am not successful. The data from map1 is confirmed to be feeding into map2.

This is how I thought I was supposed to specify the input path of a single mapper:

        //Setting Configuration for map2
        JobConf map2 = new JobConf(false);
        String[] map2Args = new GenericOptionsParser(map2, args).getRemainingArgs();
        FileInputFormat.setInputPaths(map2, new Path(map2Args[1]));
        ChainMapper.addMapper(  conf,
                                Map2.class,
                                LongWritable.class,
                                Text.class,
                                Text.class,
                                Text.class,
                                true,
                                map2);

conf is the main job configuration and args consists of 3 values. 1st value is an input file, 2nd value is an input file, 3rd value is the intended output file.

What is the correct way to specify an input path for an individual mapper which is not the first when dealing with data joins and multiple mappers in hadoop?

Please ignore this question, I believe I mis-read hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapred/lib/… and it is not possible to override the input path. If it is possible please post though. — d.lanza38

Subramanyam M Subramanyam M · Accepted Answer · 2014-12-20T18:31:42

this scenario can be solved by using Multiple Input Format. Using this Input Format we can read two files of different formats and the result of both combined goes to reducer job.

The brief description of the concept and examples are given in the below link.

https://github.com/subbu-m/MultipleInputFormat

I hope this information helps.

Hadoop data join from two files - How to force mappers to read specific files

1 Answers