Correlating input files to output files

Question

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.

Input files are like:

MyDir/file1
MyDir/file2
MyDir/file3

Output file are like:

MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002

I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.

I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?

Praveen Sripati Praveen Sripati · Accepted Answer · 2012-01-24T05:00:06

One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.

Correlating input files to output files

2 Answers