1
votes

I have a MR streaming job. My code is in C++. Its a mapper only job, with no reducer. Input to the the job is a directory containing three files. Job creates 3 mappers. Each mapper processes one input file and produces one output file in different format.

Input files are like:

MyDir/file1
MyDir/file2
MyDir/file3

Output file are like:

MyDir/Output/part-00000
MyDir/Output/part-00001
MyDir/Output/part-00002

I want to correlate input files to output files. For example, input file MyDir/file1 may correspond to output file MyDir/Output/part-00002, i.e. mapper that processed input file MyDir/file1 may have produced output file MyDir/Output/part-00002.

I want to know this relationship, i.e., which input file corresponds to which output file. Is there a simple way to know this?

2

2 Answers

0
votes

One way I can think of is it to have the i/p and the o/p file names of the Job the same. Get the input file name (map.input.file environment property) which the mapper is processing and then us it in the MultipleOutputFormat#generateFileNameForKeyValue method.

0
votes

With how Hadoop is designed, the only relationship that you can rely on, without you expressly naming the output files as per the other answer, is that the number of output files corresponds to the number of final tasks being run, usually reducers (mappers in your case, since you're not running any reducers).

If Hadoop later decides to run more mappers/reducers instead of just 3 (larger input files, more nodes available), you'll get a different number of output files.