I have encountered this strange issue when running some test code on a Cloudera-based HBase deployment. Assume these are my row keys (a simplified version of my actual row key structure):
a_1
a_2
a_3
b_1
b_2
b_3
c_1
c_2
c_3
And I run a scan with start, stop= b_2, c_2 (exlusive), I get the rows:
b_2
b_3
c_1
When I add a Fuzzy filter for "?_2" keeping the same start-stop, it seems to ignore start-stop and returns these rows:
a_2
b_2
c_2
whereas I would expect:
b_2
since a_2 and c_2 are out of my scan range.
Now this is where it gets interesting, I installed a separate pseudo-distributed HBase v 2.0.4 on my PC and in this setup it works as expected! The only differences are the HBase version and my installation not working on a cluster.
So I am trying to find why this is happening, and I have a few questions:
- Am I wrong in my assumption that FuzzyRowFilter should respect the start-stop rows?
- Could it simply be a bug in my cluster HBase version? (Cloudera)
- Could it be that FuzzyRowFilter started as a full table scan and later versions evolved it to use the range? Note that I searched for a clue in HBase Jira but could not find an issue about this. Neither could I find any unit test cases for FuzzyRowFilter that checks correctness of the range. Test cases all have full Scan()s with no range.
- Could it be happening as a result of some cluster-deployment intricacy that I am not aware of. (I don't think so, but..)
Thanks.