10
votes

I need to write a MapReduce Job that Gets all rows in a given Date Range(say last one month). It would have been a cakewalk had My Row Key started with Date. But My frequent Hbase queries are on starting values of key.

My Row key is exactly A|B|C|20120121|D . Where combination of A/B/C along with date (in YearMonthDay format) makes a unique row ID.

My Hbase tables could have upto a few million rows. Should my Mapper read all the table and filter each row if it falls in given date range or Scan / Filter can help handling this situation?

Could someone suggest (or a snippet of code) a way to handle this situation in an effective manner?

Thanks -Panks

4
Why don't you copy the contents of the table to a new one with the key rearranged and scrap the old one?Mario
@Mario what if the table has a trillion keys? And he needs to do this often?markgiaconia

4 Answers

5
votes

You can use a RowFilter with a RegexStringComparator. You'd need to come up with a RegEx that filters your dates appropriately. This page has an example that includes setting a Filter for a MapReduce scanner.

11
votes

A RowFilter with a RegEx Filter would work, but would not be the most optimal solution. Alternatively you can try to use secondary indexes.

One more solution is to try the FuzzyRowFIlter. A FuzzyRowFilter uses a kind of fast-forwarding, hence skipping many rows in the overall scan process and will thus be faster than a RowFilter Scan. You can read more about it here.

Alternatively BloomFilters might also help depending on your schema. If your data is huge you should do a comparative analysis on secondary index and Bloom Filters.

0
votes

I am just getting started with HBase, bloom filters might help.

0
votes

You can modify the Scan that you send into the Mapper to include a filter. If your date is also the record timestamp, it's easy:

Scan scan = new Scan();
scan.setTimeRange(minTime, maxTime);
TableMapReduceUtil.initTableMapperJob("mytable", scan, MyTableMapper.class, 
     OutputKey.class, OutputValue.class, job);

If the date in your row key is different, you'll have to add a filter to your scan. This filter can operate on a column or a row key. I think it's going to be messy with just the row key. If you put the date in a column, you can make a FilterList where all conditions must be true and use a CompareOp.GREATER and a CompareOp.LESS. Then use scan.setFilter(filterList) to add your filters to the scan.