configuring Hadoop to use a different Reducer process for each key?

Question

Related to my question I have a streaming process written in Python.

I notice that each Reducer gets all the values associated with multiple keys through sys.stdin.

I would prefer to have the sys.stdin only have the values associated with one key. Is this possible with Hadoop? I figure a different process per key would be perfect, but can't find a configuration that gives this behavior.

Can someone assist me with information or code that can aid me with this?

highlycaffeinated highlycaffeinated · Accepted Answer · 2013-04-08T21:36:36

Each mapper must know the total number of reducers available because it generates one output file for each reducer. If you know the number of keys before starting the job, you can configure the job to have that many reducers. Otherwise you're out of luck, since the total number of keys won't be known until after the mappers have completed.

configuring Hadoop to use a different Reducer process for each key?

2 Answers