Is there any Scan/Filter API with the following behavior?
Given time range, I would like the scanner to include data from HFiles out of range, for row keys included in the HFiles which are in range. The idea is to scan in-memory indexes of all HFiles, but get data from disk only for rowkeys from HFiles that are in range.
For example, if HFile1
is in range and HFile2
is out of range, and rowkey1
has any data in HFile1
, I would like to get all columns of rowkey1
from HFile2
as well, as if it were in range.
On the other hand, if rowkey2
is included in HFile2
but not in HFile1
, the index scanner should just skip to the next row key.
The use case is to load entire rows that were modified (even on just one column) during the last X hours, avoiding full scan or any disk scan of redundant data. This is going to be integrated into Spark/MR applications, probably based on TableSnapshotInputFormat, so I guess I could ship some custom code for HRegion, HStore, or whatever, if it comes to this.
Thanks a lot