0
votes

We are planning to use HBase in one of our projects.

We are getting some browse information from our internal systems, the data format is below.

Our requirement is we have to develop 3 different types of searches

  1. D IP + Date Range( Start date and End date )
  2. S IP + Date Range( Start date and End date )
  3. URL + Date Range( Start date and End date )

I am thinking to create 3 HBase tables like

  1. Row key as DestinationIP + DateTime
  2. Row key as SourceIP + DateTime
  3. Row key as URL + DateTime

If I go with the above approach it will cost us lot of space to store this data.

S IP            DateTime       Method URL        - ResponseCode - D IP -
176.204.134.111 20140421093842 GET    http://googleads.g.doubleclick.net/pagead/adview?ai=CAbmt4K5UU47XB5GS8wPOi4C4CKH1-ZwCkbiU7inAjbcBEAEgptSKH1D0-ev7B2CRdsgBAakC4V3k_lZFkj6oAwHIA4oEqgSQAU_QtfygurroekV-h5dYCoVP70qKDV1sAkiI60NNZiQ1wICQkqb5XMC3TllLKrhD0KxX0kb9-LnGkCDTqGmDE3Do-UdLGIyluqQ7MwoAcuTJMUajYKOflKPd2ZDj6RlKUAI9pbdkb96-k-XTVpON9rjUM2vUkvjwW3BwSfQk656GjoyUcEwsjwWId7p7obHcTsAEqf_DzQKSBQQIBBgBkgUECAUYBJAGAdgGAoAHueeCC5gHAQ&sigh=7zrG0DRVvMA 0 TCP_MISS/200 - 173.194.66.155 -  0
2.50.165.129    20140421093842 GET    http://www.alquds.co.uk/wp-content/uploads/2014/04/1217.jpg 0 TCP_MISS/200 - 46.165.251.78 -  0

What is a good schema design for these above requirements?

1

1 Answers

0
votes

Consider using OpenTSDB, which is optimized for the storage of small key-value time series data.

Even if you don't choose to use it, definitely read this slide deck discussing the schema design decisions that went into it.