HBase shell - Model and retrieve data for historical data and audit information

Question

I am looking for some help on HBase (fairly new to it and trying to understand if I cna use it for my POC).

Use case: I need a historical price data table which for e.g. will store data for say 10 different indices. One of the requirement would be to trace or audit trail the changes made to any attribute of a constituents or shares or instrument. Also if I want to find the list of instruments which has variance of price change n% in the month of say Jan 2010.

Data e.g. (some possibilities) (columns mentioned below are just to illustrate)

    date instrument high low mid user ts
    20130101 goog 34 33.4 33.8 system 10:30
    20130101 yhoo 24 23.4 23.8 system 10:50
    20130101 goog 34.1 33.3 33.8 ops 10:55
    20130101 msft 134 133.4 133.8 system 11:00
    20130101 msft 134 133.9 133.8 ops 11:30
    20130101 goog 34.1 33.3 34.1 ops 11:30
    20130101 aapl 48 48.4 47.9 system 11:30

Similar data will be availabe for subsequent dates. Kindly note that in a day a instrument's attribute/attributes value could change by any user (as seen for goog, msft) and for some no change at all (aapl, yhoo).

What would be the best data model which I can use to store this data and from which retrieval would also be easy?

If HBase has composite rowkey (please help me with syntax in case it is) then I can have something like,

    ROW                          COLUMN+CELL                        
    goog-20130101                 column=cf1:h1, timestamp=1389020633920, value=34
    goog-20130101                 column=cf1:h2, timestamp=1389020654614, value=34.1
    goog-20130101                 column=cf1:h3, timestamp=1389020668338, value=34.1
    goog-20130101                 column=cf1:l1, timestamp=1389020633920, value=33.4
    goog-20130101                 column=cf1:l2, timestamp=1389020654614, value=33.8
    goog-20130101                 column=cf1:l3, timestamp=1389020668338, value=33.3
    goog-20130101                 column=cf1:u1, timestamp=1389020633920, value=system
    goog-20130101                 column=cf1:u2, timestamp=1389020654614, value=ops
    goog-20130101                 column=cf1:u3, timestamp=1389020668338, value=ops

    aapl-20130101                 column=cf1:h1, timestamp=1389020633920, value=48
    aapl-20130101                 column=cf1:l1, timestamp=1389020633920, value=48.4
    aapl-20130101                 column=cf1:u1, timestamp=1389020633920, value=system

1) Can we create such rowkeys? How? 2) If the data for a rowkey already exists (goog-20130101) for e.g. then how can we inform/put the data to the same rowkey BUT column name is changed to h1, l1, u1 in our case? subsequently to h2, l2 etc. Is this acheivable? 3) How to retrieve the latest data and its values (say hi for goog on a date)?

Or if someone has come across such data (where you track multiple events/activity of user/object anything for a day and store), can advice on a better model for this which suits HBase.

Thanks in advance for your help.

WestCoastProjects WestCoastProjects · Accepted Answer · 2014-01-07T17:22:34

One aspect of HBase you may not have yet completely assimilated is its automatic creation of and maintenance of multiple versions of a cell. A cell is a {row, column, version} tuple in HBase.

HBase retains three versions of the same cell by default - and it can be configured to store any number of versions. The max number is set at time of table creation. Also see the HColumnDescriptor information

HBase versioning info from the HBase book

Therefore you may have more flexibility in your row key selection.

HBase shell - Model and retrieve data for historical data and audit information

2 Answers