1
votes

I am looking for some help on HBase (fairly new to it and trying to understand if I cna use it for my POC).

Use case: I need a historical price data table which for e.g. will store data for say 10 different indices. One of the requirement would be to trace or audit trail the changes made to any attribute of a constituents or shares or instrument. Also if I want to find the list of instruments which has variance of price change n% in the month of say Jan 2010.

Data e.g. (some possibilities) (columns mentioned below are just to illustrate)

    date instrument high low mid user ts
    20130101 goog 34 33.4 33.8 system 10:30
    20130101 yhoo 24 23.4 23.8 system 10:50
    20130101 goog 34.1 33.3 33.8 ops 10:55
    20130101 msft 134 133.4 133.8 system 11:00
    20130101 msft 134 133.9 133.8 ops 11:30
    20130101 goog 34.1 33.3 34.1 ops 11:30
    20130101 aapl 48 48.4 47.9 system 11:30

Similar data will be availabe for subsequent dates. Kindly note that in a day a instrument's attribute/attributes value could change by any user (as seen for goog, msft) and for some no change at all (aapl, yhoo).

What would be the best data model which I can use to store this data and from which retrieval would also be easy?

If HBase has composite rowkey (please help me with syntax in case it is) then I can have something like,

    ROW                          COLUMN+CELL                        
    goog-20130101                 column=cf1:h1, timestamp=1389020633920, value=34
    goog-20130101                 column=cf1:h2, timestamp=1389020654614, value=34.1
    goog-20130101                 column=cf1:h3, timestamp=1389020668338, value=34.1
    goog-20130101                 column=cf1:l1, timestamp=1389020633920, value=33.4
    goog-20130101                 column=cf1:l2, timestamp=1389020654614, value=33.8
    goog-20130101                 column=cf1:l3, timestamp=1389020668338, value=33.3
    goog-20130101                 column=cf1:u1, timestamp=1389020633920, value=system
    goog-20130101                 column=cf1:u2, timestamp=1389020654614, value=ops
    goog-20130101                 column=cf1:u3, timestamp=1389020668338, value=ops

    aapl-20130101                 column=cf1:h1, timestamp=1389020633920, value=48
    aapl-20130101                 column=cf1:l1, timestamp=1389020633920, value=48.4
    aapl-20130101                 column=cf1:u1, timestamp=1389020633920, value=system

1) Can we create such rowkeys? How? 2) If the data for a rowkey already exists (goog-20130101) for e.g. then how can we inform/put the data to the same rowkey BUT column name is changed to h1, l1, u1 in our case? subsequently to h2, l2 etc. Is this acheivable? 3) How to retrieve the latest data and its values (say hi for goog on a date)?

Or if someone has come across such data (where you track multiple events/activity of user/object anything for a day and store), can advice on a better model for this which suits HBase.

Thanks in advance for your help.

2

2 Answers

0
votes

One aspect of HBase you may not have yet completely assimilated is its automatic creation of and maintenance of multiple versions of a cell. A cell is a {row, column, version} tuple in HBase.

HBase retains three versions of the same cell by default - and it can be configured to store any number of versions. The max number is set at time of table creation. Also see the HColumnDescriptor information

HBase versioning info from the HBase book

Therefore you may have more flexibility in your row key selection.

0
votes

Rowkeys in HBase are arbitrary byte arrays, so you can make them however you would like. You can call Bytes.toBytes() and pass it various types to convert the rowkeys to byte arrays suitable in hbase.

There is no native way in hbase to increment the column names like you want to do. You would need some external source of data to maintain the counts and your application code would need to read what the current version is and write with a new column name in the Get request. This could be a column in hbase that you increment when you update 'h' 'l' and 'u'. Writing new columns isn't a problem, as long as it is on the order of millions of unique columns. Another technique you could consider is including a timestamp in the qualifier so they can be unique without needing an extra read to keep up with version changes.

To determine the latest changes, you can either read the whole row and process it yourself or you can use a filter, like: ColumnPrefixFilter to retrieve data between a specific range.

There is a project OpenTSDB which uses the notion of rowkeys bucketed by time. It is generally used for infrastructure monitoring, but the code is open source so you may be able to find some useful techniques there.