Is anyone overridden Mapper run(Context) method in your implementations?

Question

https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html#method.summary

run (Context) method of org.apache.hadoop.mapreduce.Mapper

a). Expert users can override this method for more complete control over the execution of the Mapper.

Currently what is the default behavior of run(Context) method.
If i override run(Context) what kind of special control will get as per the documentation?
Is anyone overridden this method in your implementations?

Chris Nauroth Chris Nauroth · Accepted Answer · 2017-06-19T20:17:09

Currently what is the default behavior of run(Context) method.

The default implementation is visible in the Apache Hadoop source code for the Mapper class:

/**
 * Expert users can override this method for more complete control over the
 * execution of the Mapper.
 * @param context
 * @throws IOException
 */
public void run(Context context) throws IOException, InterruptedException {
  setup(context);
  try {
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
  } finally {
    cleanup(context);
  }
}

To summarize:

Call setup for one-time initialization.
Iterate through all key-value pairs in the input.
Pass the key and value to the map method implementation.
Call cleanup for one-time teardown.

If i override run(Context) what kind of special control will get as per the documentation?

The default implementation always follows a specific sequence of execution in a single thread. Overriding this would be rare, but it might open up possibilities for highly specialized implementations, such as different threading models or attempting to coalesce redundant key ranges.

Is anyone overridden this method in your implementations?

Within the Apache Hadoop codebase, there are two overrides of this:

ChainMapper allows chaining together multiple Mapper class implementations for execution within a single map task. The override of run sets up an object representing the chain, and passes each input key/value pair through that chain of mappers.
MultithreadedMapper allows multi-threaded execution of another Mapper class. That Mapper class must be thread-safe. The override of run starts multiple threads iterating the input key-value pairs and passing them through the underlying Mapper.

Is anyone overridden Mapper run(Context) method in your implementations?

1 Answers