0
votes

I am executing the job as:

hadoop/bin/./hadoop jar /home/hadoopuser/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar  -D mapred.reduce.tasks=2 -file kmeans_mapper.py    -mapper kmeans_mapper.py -file kmeans_reducer.py \
-reducer kmeans_reducer.py -input gutenberg/small_train.csv -output gutenberg/out

When the two reducers are done, I would like to do something with the results, so ideally I would like to call another file (another mapper?) which would receive the output of the reducers as its input. How to do that easily?

I checked this blog which has a Mrjob example, which doesn't explain, I do not get how to do mine.

The MapReduce tutorial states:

Users may need to chain MapReduce jobs to accomplish complex tasks which cannot be done via a single MapReduce job. This is fairly easy since the output of the job typically goes to distributed file-system, and the output, in turn, can be used as the input for the next job.

but it doesn't give any example...

Here is some code in Java I could understand, but I am writing Python! :/


This question sheds some light: Chaining multiple mapreduce tasks in Hadoop streaming

1
You specified the output to an hdfs directory, right? You'd need to do another mapreduce job with that as your input - OneCricketeer
Yes @cricket_007, I am not sure how to do that with one call. I mean I could execute the job just like my answer and then execute another job, which would invoke only a mapper. But that seems bizarre, can't I do it by pressing the ENTER key once? :) - gsamaras
I'm pretty sure the tutorial you are reading (which is over 2 years old for the page you linked to) is saying the Java MapReduce api allows stringing jobs together. The steaming mapreduce only goes through standard in and standard out. You may be able to pipe the commands through each other, but only if you output to standard out - OneCricketeer
@cricket_007 yes, I am using sys.stdin for input and print for output. - gsamaras
Right, in your python code because that is how the streaming api works, but it gets bundled in a jar file, and sent to hdfs. You won't be able to get the standard output of the job back to your local terminal to pipe it to a new command, that is why the output is written to HDFS - OneCricketeer

1 Answers

1
votes

It is possible to do what you're asking for using the Java API as you've found an example for.

But, you are using the streaming API which simply reads standard in and writes to standard out. There is no callback to say when a mapreduce job has completed other than the completion of the hadoop jar command. But, because it completed, doesn't really indicate a "success". That being said, it really isn't possible without some more tooling around the streaming API.

If the output was written to the local terminal rather than to HDFS, it might be possible to pipe that output into the input of another streaming job, but unfortunately, the inputs and outputs to the steaming jar require paths on HDFS.