12
votes

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:

  1. Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
  2. Dump all the data into a single column and serialize it using Avro or Thrift.

What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

2

2 Answers

12
votes

In summary, I lean towards using distinct columns per key.

1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.

2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.

3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.

4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.

5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

5
votes

The right answer to this is a bit more complicated, so I'll give you the tl;dr first.

Use Avro/Thrift/Protobuf

You will need to strike a balance between how many fields to pack in a record vs. columns.

You'll typically want to put fields ("keys" in your original question) that are frequently accessed together into something like an avro record because as mentioned by cmonkey you don't want the overhead of retrieving extra data you won't use.

By making your row very wide, you'll increase seek times when fetching a subset of columns because of how HFiles are stored. Again, determining what is optimal, comes down to your access patterns.

I would also like to point out that by using something like avro, you're also providing yourself with evolvability. You don't need to delete the row and re-add it with the record containing a new field. Avro has rules for backward-compatibility and forward-compatibility. This actually makes your life much much easier because you can read both new and old records WITHOUT rewriting your data or forcing updates to older client code.

You should nearly always use compression in HBase (SNAPPY is always a good choice).