1
votes

I want to pipe my hadoop streaming job. For example I had run a command hadoop jar hadoop-streaming.jar -mapper map1.py -reducer reducer.py -input xx -output /output1

But I want to use output from step one to be an input for my step two of mapreduce job without storing in hdfs maybe output as stdout. Is there something like linux pipe? Such as hadoop jar hadoop-streaming.jar -mapper map1.py -reducer reducer.py -input xx | hadoop jar hadoop-streaming.jar -mapper map2.py -reducer reducer2.py -output /output

1

1 Answers

0
votes

I had the same problem and ended up using a bash/shell script to run the hadoop streaming command. I created a file called hadoop.sh that contained the following:

rm -r output | bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -files /hadoop-2.7.3/script/mapper.php -input /data/* -output output -mapper "php mapper.php" -jobconf mapred.reduce.tasks=1
#add a beginning/ending php to the file
ex -sc '1i|<?php' -c '$a|?>' -cx output/part-00000
#move the file from /output to /script
mv /hadoop-2.7.3/output/part-00000 /hadoop-2.7.3/script/part-00000.php

The part-00000 file becomes the part0000.php file for the next hadoop command.