7
votes

I have HBase table with row keys, which consist of text ID and timestamp, like next:

...
string_id1.1470913344067
string_id1.1470913345067
string_id2.1470913344067
string_id2.1470913345067
...

How can I filter Scan of HBase (in Scala or Java) to get results with some string ID and timestamp more than some value?

Thanks

3
What do you want to get. Give an example of what do you want and what have you tried?sarveshseri
@SarveshKumarSingh, for example, if I have only 4 keys (like in question), and I want to get only with string_id2 and with timestamp more than 1470913345000, I will get result with only one last keyVital Yeutukhovich
Can you give a detailed explanation of your need and what have you tried? Something vague like this is not solvable.sarveshseri

3 Answers

5
votes

Fuzzy row approach is efficient for this kind of requirement and when data is is huge : As explained by this article FuzzyRowFilter takes as parameters row key and a mask info.

In example above, in case we want to find last logged in users and row key format is userId_actionId_timestamp (where userId has fixed length of say 4 chars), the fuzzy row key we are looking for is ????_login_. This translates into the following params for FuzzyRowKey:

FuzzyRowFilter rowFilter = new FuzzyRowFilter(
 Arrays.asList(
  new Pair<byte[], byte[]>(
    Bytes.toBytesBinary("\x00\x00\x00\x00_login_"),
    new byte[] {1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0})));

Would suggest to go through hbase-the-definitive guide -->Client API: Advanced Features

-2
votes

Lets say you somehow ended up having your lines in a monadic traversable structure like List or RDD. Now, you want to have only the strings with id = "string_id2" and timestamp > 1470913345000.

Now what is the problem here ? Just filter you traversable monadic structure on these two criteria.

val filtered = listOrRddOfLines
  .map(l => {
    val idStr :: timestampStr :: Nil = l.split('.').toList
    (idStr, timestampStr.toLong)
  })
  .filter({
    case (idStr, timestamp) => idStr.equals("string_id2") && (timestamp > "1470913345000".toLong)
  })
-2
votes

I resolve my problem by using to filters:
- PrefixFilter (I put to this filter first part of row key. In my case - string ID, for example "string_id1.")
- RowFilter (I put there two parametres: first - CompareOp.GREATER_OR_EQUAL, second - all my row key with necessary timestamp, for example "string_id1.1470913345000"

In result I get all cells with row key, which has necessary string_id if first part, and with timestamp more or equal than I put in filter in second part. It is exactly what I want.

Code snippet:

val s = new Scan()
s.addFamily(family.getBytes)
val filterList = new FilterList()
filterList.addFilter(new PrefixFilter(Bytes.toBytes(prefixOfRowKey)))
filterList.addFilter(new RowFilter(CompareOp.GREATER_OR_EQUAL, new BinaryComparator(valueForBinaryFilter.getBytes())))
s.setFilter(filterList)
val scanner = table.getScanner(s)

Thanks to everyone who helped to find a solution.