5
votes

Is there any Scan/Filter API with the following behavior?

Given time range, I would like the scanner to include data from HFiles out of range, for row keys included in the HFiles which are in range. The idea is to scan in-memory indexes of all HFiles, but get data from disk only for rowkeys from HFiles that are in range.

For example, if HFile1 is in range and HFile2 is out of range, and rowkey1 has any data in HFile1, I would like to get all columns of rowkey1 from HFile2 as well, as if it were in range. On the other hand, if rowkey2 is included in HFile2 but not in HFile1, the index scanner should just skip to the next row key.

The use case is to load entire rows that were modified (even on just one column) during the last X hours, avoiding full scan or any disk scan of redundant data. This is going to be integrated into Spark/MR applications, probably based on TableSnapshotInputFormat, so I guess I could ship some custom code for HRegion, HStore, or whatever, if it comes to this.

Thanks a lot

1

1 Answers

2
votes

If this is the use case,

The use case is to load entire rows that were modified (even on just one column) during the last X hours, avoiding full scan or any disk scan of redundant data

Why the Scan with timestamp range will not work? The HBase JAVA API org.apache.hadoop.hbase.client.Scan.setTimeRange(long, long) takes a time-range as input & it fetches rows which were modified in this time-range only.

If you want it to be more flexible, then apply a KeyOnlyFilter() and get all the rowkeys. Later you can do a batch Get based on the row count.