1
votes

I'm quite new to hbase, and imagine we want to aggregate unique document counts per day for each category.

First idea was somewhat like below

table name: yyyyMMdd row key : category_docid column family : whatever seems to be used afterwards,

In such case, I think I can scan with rowkey start prefix and end prefix, then count the keys of them.

But there are several problems 1. scan seems to be heavy for count operation since I have to scan through all the Result array and increment by myself. 2. categories are continuously changing, would be much better if it's possible to do something like 'group by' in SQL but I haven't found how yet.

What do you think of this approach or is there any other better idea?

2
Could think of adding an additional key that will maintain the count of the key pattern you are interested in. So instead of a count operation/scan operation you can perform a single GET operation.Arun A K

2 Answers

1
votes

HBase doesn't provide realtime table counts, it has to perform a full table scan to count the rows, which is slow.

In order to have realtime counts you have to implement your own counters in your table and increment them when you insert new rows (or decrement them when you delete rows). HBase can perfectly handle tons of writes per sec and that's his strongest point. You can even have scope counters (per hour, day, week, month, year...) by using multiple families/columns combined with a time-to-live for automatic pruning of old records. It's up to you how to implement it :)

See (this working JAVA example) from the HBase book source code.

0
votes

Setting the timerange filter over your scan object along with rowkey prefix filter would help you to achieve your task.