2
votes

I'm writing Hadoop streaming jobs in R and I've encountered a rather odd situation for which I can't find any documentation. I'd like to run a reduce job (no mapper required) that passes directly to another mapper. Is it possible to stack a map job directly after a reduce job without an initial mapper? If I write an identity mapper to pass output to my reduce job can I then pass the reduce output to another mapper, and if so, how? My current code is:

$HADOOP_HOME/bin/hadoop jar /opt/cloudera/parcels/CDH-5.3.2-1.cdh5.3.2.p0.10/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar \
  -reduce myreducer.r \
  -input myinput/ \
  -output myoutputdir \
  -file file1.r \
  -file file2.Rdata

And this is not working.

1

1 Answers

2
votes

I'll answer your question and then give my suggestion.

You cannot send reduce output directly to a mapper. It's always map, then reduce. Just the way it works. However, you can have two MapReduce jobs. Have the reducer write out to HDFS, then start a second map-only job that reads the output data of the first job.

In general, if you want to do a map after a reduce, you can pretty much always fold them into the same thing. Think about it: if you are mapping every output record from a reducer, why not just run that "map" code at the end of the reducer? This is far more efficient than running two MapReduce jobs. If you really don't want to write a new R script to do this, you can wrap it in a bash script so it looks like they are one script.