How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

Question

I am running a mapreduce job to export data from HBase to HDFS. There are multiple filters being applied to the scan.

It is not possible to limit the scan by the row key as it does not contain required information.

When it comes to running MR job, YARN creates a mapper for each region in HBase. Some of those regions contain only filtered data and hence mappers don't receive anything to read and get terminated after a period of time. The volume of data to be extracted is significantly less then the total amount of data, so the job eventually fails because of the large number of mappers being terminated.

The answer I am not looking for:

Implementing "manual" filtering within the mapper.
Increasing timeout interval.

What I am looking for is one of these:

A link to an article about how this problem is solved.
An efficient solution or an idea (Not necessarily with code) for this, which does not involve running a full HBase table through mappers. Or at least (Let's be real) reduces the compute load within the mappers.
A confirmation that there is no efficient way of doing this, as I've spent a fair amount of time looking for this.

I believe that the code sample is not necessary as the person who understands HBase will know what I am asking for.

Thanks in advice.

Isn't that what Secondary Indexes... chapter in HBase Guide is about? hbase.apache.org/book.html#secondary.indexes — mazaneicha
Thanks for the suggestion, this is more of a one off job, so did not want to create a secondary indexes just for that. What i've done in the end is I've written another mapreduce job, which classified each rowkey and produced an output with start and end row keys, which allowed me to use setStart/EndRow properties for my scan — Sergey
Cool, thanks for the update. Worth posting as answer to your own question! So you're still building an index, sort of... :) — mazaneicha
Done, thanks for the advice, it pointed in the right direction :) — Sergey

Sergey Sergey · Accepted Answer · 2019-08-14T16:29:15

In order to solve this problem I've created a MR job.

Mapper classified each row key in to one of the categories and picked the first and last element for each type (Because everything is sorted within a region). In order to pick the last element, I've been updating a single object and assigning each value, which was landing in to a mapper. Then I wrote both values in to context in the cleanup phase (classifier_name as a key and row_key as a value).

Mappers outputs were light (number of categories * 2), so I've set the number of reducers to 1, and wrote some basic logic to create object with low_row/high_row, which was updated on the flight and I did not have to sort anything at the end. So the final output was of the form:
classifier_name, start_rowKey, end_rowKey

I was then able to use these values to limit my scan.

Hope that will help someone :)

How can I limit the scan of HBase to only relevant (Unfiltered) regions for the MapReduce job

1 Answers