0
votes

I am moving from mysql to hbase due to increasing data.

I am designing rowkey for efficient access pattern.

I want to achieve 3 goals.

  1. Get all results of email address
  2. Get all results of email address + item_type
  3. Get all results of particular email address + item_id

I have 4 attributes to choose from

  1. user email
  2. reverse timestamp
  3. item_type
  4. item_id

What should my rowkey look like to get rows efficiently?

Thanks

2
Could you provide the equivalent SQL fulfill your goals? - ericson
select * from table where email = "xyz" and item_type=2 order by ts desc...I have unique key index on (email, item_type ,item_id) - Amit Patil
Are email and item_type mandatory in where statement? Do you have statements like select * from table where email = "xyz" or select * from table where item_type=2 ? And you have mentioned you want to "Get all results of particular item_id" in your post. I don't see it from your SQL. - ericson
sorry...email is always there in any query - Amit Patil

2 Answers

1
votes

Assuming your main access is by email you can have your main table key as email + reverse time + item_id (assuming item_id gives you uniqueness)

You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)

0
votes

Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:

Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.

Choose based on better balanced distribution of records: reverse timestamp plus most frequently queried natural key e.g. reversetimestamp-email or email-reversetimestamp

In that manner you will avoid hot spotting on your region servers. .

To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).

But you can build your own a couple of ways:

a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke

b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.