0
votes

How do mapper/reducer instances get re-used within a jvm that's kept alive perpetually?

For example, let's say I wanted to do something like this:

public class MyMapper extends MapReduceBase implements Mapper<K1, V1, K2, V2> {

    private Set<String> set = new HashSet<String>();

    public void map(K1 k1, V1 v1, OutputCollector<K2, V2> output, Reporter reporter) {
        ... do stuff ...

        set.add(k1.toString()); //add something to a list so that it can be used later

        ... do other stuff ...


        if(set.contains("someString"))
            emitSomeKindOfOutput(output);
        else
            emitSomeOtherKindOfOutput(output);
    }

}

If the same mapper can be used for multiple tasks/jobs, then the member set could cause problems because it would still contain other junk from previous tasks/jobs. Is this kind of re-use possible in hadoop? What about for reducers?

2

2 Answers

2
votes

You are definitely safe. Mapper and reducer instance are not reused. If you need to perform some initialization or cleanup you can override the two methods configure and close provided by MapReduceBase. This is not required by your code sample.

If set was a static variable then you would have to clear it in the close() method to be safe, even if not required by most site configuration (basically a new JVM is forked for each map by default, you have to configure reuse.jvm.num.tasks to enable JVM reuse). Two map tasks are never run concurrently in the same JVM.

0
votes

As far as I know, Hadoop is based on a shared nothing architecture and so your 'private Set set' variable won't get shared among different mappers. So, there shouldn't be any question of getting, as you mentioned - 'junk from previous mappers'.