10
votes

For a python Hadoop streaming job, how do I pass a parameter to, for example, the reducer script so that it behaves different based on the parameter being passed in?

I understand that streaming jobs are called in the format of:

hadoop jar hadoop-streaming.jar -input -output -mapper mapper.py -reducer reducer.py ...

I want to affect reducer.py.

4

4 Answers

18
votes

The argument to the command line option -reducer can be any command, so you can try:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input inputDirs \
    -output outputDir \
    -mapper myMapper.py \
    -reducer 'myReducer.py 1 2 3' \
    -file myMapper.py \
    -file myReducer.py

assuming myReducer.py is made executable. Disclaimer: I have not tried it, but I have passed similar complex strings to -mapper and -reducer before.

That said, have you tried the

-cmdenv name=value

option, and just have your Python reducer get its value from the environment? It's just another way to do things.

2
votes

In your Python code,

import os
(...)
os.environ["PARAM_OPT"]

In your Hapdoop command include:

hadoop jar \
(...)
-cmdenv PARAM_OPT=value\
(...)
2
votes

You can -reducer as the below command

hadoop jar hadoop-streaming.jar \
-mapper 'count_mapper.py arg1 arg2' -file count_mapper.py \
-reducer 'count_reducer.py arg3' -file count_reducer.py \

you can revise this Link

1
votes

If you are using python you may want to check out dumbo which provides a nice wrapper around hadoop streaming. In dumbo you pass parameters with -param as in :

dumbo start yourpython.py -hadoop <hadoop-path> -input <input> -output <output>  -param <parameter>=<value>

And then read it in the reducer

def reducer:
def __init__(self):
    self.parmeter = int(self.params["<parameter>"])
def __call__(self, key, values):
    do something interesting ...

You can read more in the dumbo tutorial