I am running a mapreduce job to export data from HBase to HDFS. There are multiple filters being applied to the scan.
It is not possible to limit the scan by the row key as it does not contain required information.
When it comes to running MR job, YARN creates a mapper for each region in HBase. Some of those regions contain only filtered data and hence mappers don't receive anything to read and get terminated after a period of time. The volume of data to be extracted is significantly less then the total amount of data, so the job eventually fails because of the large number of mappers being terminated.
The answer I am not looking for:
- Implementing "manual" filtering within the mapper.
- Increasing timeout interval.
What I am looking for is one of these:
A link to an article about how this problem is solved.
An efficient solution or an idea (Not necessarily with code) for this, which does not involve running a full HBase table through mappers. Or at least (Let's be real) reduces the compute load within the mappers.
A confirmation that there is no efficient way of doing this, as I've spent a fair amount of time looking for this.
I believe that the code sample is not necessary as the person who understands HBase will know what I am asking for.
Thanks in advice.