2
votes


shortly speaking I need a way to give Hadoop MapRedice API hint on what host I'd like to run certain reducer based on its partition. Is there any way?

Somewhat longer story:
I have few mapper tasks which generate (or import from another source) records for certain HBase table. Emitted records have ImmutableBytesWritable as keys. Number of reducers for this job exactly matches number of table regions and custom partitioner is used to distribute records so records of every region gets to appropriate reducer.

Reducers are intended to generate HFile images, one image per region so later bulk load could be used on them. The only serious problem here is I'd like reducers at least to 'try to run' on the same hosts appropriate region servers are running. This is to get good probability of generated HFiles locality (in terms of HDFS) for appropriate HBase region servers.

Any idea how to get this behavior?

Alternative could be how to 'request' HDFS file to 'get local'. Having this I could start another MR job with mappers bound to region servers (through splits) and request corresponding HFile to get local.

1

1 Answers

1
votes

There is no out-of-box way to do this yet, short of writing a custom scheduler, which would be an overkill.

An upstream ticket does track this feature request at https://issues.apache.org/jira/browse/MAPREDUCE-199.