1
votes

I decided to use HBase in a project to store the users activities in a social network. Despite the fact that HBase has a simple way to express data (column oriented) I'm facing some difficulties to decide how I would represent the data.

So, imagine that you have millions of users, and each user is generating an activity when they, for example, comment in a thread, publishes something, like, vote, etc. I thought basically in two approaches with an Activity hbase table:

  1. The key could be the user reference + timestamp of activity creation, the value all the activity metadata (most of time fixed size)

  2. The key is the user reference, and then each activity would be stored as a new column inside a column family.

I saw examples for others types of system (such as blogs) that uses the 2nd approach. The first approach (with fixed columns, varying only when you change the schema) is more commonly seen.

What would be the impact in the way I access the data for these 2 approaches?

1
after some search, I've found the case of meetup, which includes some material on how to model activities slideshare.net/ghelmling/hbase-at-meetupCipriani
I found also the chapter of HBase book that will be launched by Oreilly that comment about data modeling strategies ofps.oreilly.com/titles/9781449396107/advanced.htmlCipriani

1 Answers

2
votes

In general you are asking if your table should be wide or long. HBase works with both, up to a point. Wide tables should never have a row that exceeds region size (by default 256MB) -- so a really prolific user may crash the system if you store large chunks of data for their actions. However, if you are only storing a few bytes per action, then putting all user activity in one row will allow you to get their full history with one get. However, you will be retrieving the full row, which could cause some slowdown for a lot of history (10s of seconds for > 100MB rows).

Going with a tall table and an inverse time stamp would allow you to get a users recent activity very quickly (start a scan with the key = user id).

Using timestamps as the end of a key is a good idea if you want to query by time, but it is a bad idea if you want to optimize writes to your database (writes will always be in the most recent region in the system, causing hot spotting).

You might also want to consider putting more information (such as the activity) in the key so that you can pick up all activity of a particular type more easily.

Another example to look at is OpenTSDB