0
votes

Related to my question I have a streaming process written in Python.

I notice that each Reducer gets all the values associated with multiple keys through sys.stdin.

I would prefer to have the sys.stdin only have the values associated with one key. Is this possible with Hadoop? I figure a different process per key would be perfect, but can't find a configuration that gives this behavior.

Can someone assist me with information or code that can aid me with this?

2
Why do you want to restrict each reducer to a single key?highlycaffeinated

2 Answers

1
votes

Each mapper must know the total number of reducers available because it generates one output file for each reducer. If you know the number of keys before starting the job, you can configure the job to have that many reducers. Otherwise you're out of luck, since the total number of keys won't be known until after the mappers have completed.

0
votes

Yes, if you know the total number of keys that mappers will emit. you can set it as job.setNUmReduceTasks(int n)

Also, total number of reducers that would be parrallely running can be defines in mapred-site.xml as

mapred.tasktracker.reduce.tasks.maximum

It will speed up the reduce process. However,each reducer runs as jvm tasks. So your configuration should be able to handle number of jvms which will be spawned