- Currently what is the default behavior of run(Context) method.
The default implementation is visible in the Apache Hadoop source code for the Mapper class:
/**
* Expert users can override this method for more complete control over the
* execution of the Mapper.
* @param context
* @throws IOException
*/
public void run(Context context) throws IOException, InterruptedException {
setup(context);
try {
while (context.nextKeyValue()) {
map(context.getCurrentKey(), context.getCurrentValue(), context);
}
} finally {
cleanup(context);
}
}
To summarize:
- Call
setup
for one-time initialization.
- Iterate through all key-value pairs in the input.
- Pass the key and value to the
map
method implementation.
- Call
cleanup
for one-time teardown.
- If i override run(Context) what kind of special control will get as per the documentation?
The default implementation always follows a specific sequence of execution in a single thread. Overriding this would be rare, but it might open up possibilities for highly specialized implementations, such as different threading models or attempting to coalesce redundant key ranges.
- Is anyone overridden this method in your implementations?
Within the Apache Hadoop codebase, there are two overrides of this:
ChainMapper
allows chaining together multiple Mapper
class implementations for execution within a single map task. The override of run
sets up an object representing the chain, and passes each input key/value pair through that chain of mappers.
MultithreadedMapper
allows multi-threaded execution of another Mapper
class. That Mapper
class must be thread-safe. The override of run
starts multiple threads iterating the input key-value pairs and passing them through the underlying Mapper
.