HBase row key design for reads and updates

Question

I'm try to understand the best way to design the key for my HBase Table.

My use case :

Structure right now

PersonID | BatchDate | PersonJSON

When some thing about the person is modified, a new PersonJSON and new a batchdate is inserted in to Hbase updating the old records. And every 4 hours a scan of all the people who are modified are then pushed to Hadoop for further processing.

If my key is just personID it great for updating the data. But my performance sucks because I have to add a filter on BatchData column to scan all the rows greater than a batch date.

If my key is a composite key like BatchDate|PersonID I could use startrow and endrow on the row key and get all the rows that have been modified. But then I would have lot of duplicated since the key is not unique and can no longer update a person.

Is bloom filter on row+col (personid+batchdate) an option ?

Any help is appreciated. Thanks, Abhishek

Mark Rajcok Mark Rajcok · Accepted Answer · 2015-01-12T22:49:35

In addition to the table with PersonID as the rowkey, it sounds like you need a dual-write secondary index, with BatchDate as the rowkey.

Another option would be Apache Phoenix, which provides support for secondary indexes.

HBase row key design for reads and updates

2 Answers