Is it better to use HBase columns or serialize data using Avro?

Question

I working on a project that stores key/value information on a user using HBase. We are in the process of redesiging the HBase schema we are using. The two options being discussed are:

Use HBase column qualifiers as names for the keys. This would make rows wide, but very sparse.
Dump all the data into a single column and serialize it using Avro or Thrift.

What are the design tradeoffs of the two approaches? Is one preferable to the other? Are they are any reasons not to store the data using Avro or Thrift?

cmonkey cmonkey · Accepted Answer · 2013-01-29T17:44:10

In summary, I lean towards using distinct columns per key.

1) Obviously, you are imposing that the client uses Avro/Thrift, which is another dependency. This dependency means you may remove the possibility of certain tooling, like BI tools which expect to find values in the data without transformation.

2) Under the avro/thrift scheme, you are pretty much forced to bring the entire value across the wire. Depending on how much data is in a row, this may not matter. But if you are only interested in 'city' fields/column-qualifier, you still have to get 'payments', 'credit-card-info', etc. This may also pose a security issue.

3) Updates, if required, will be more challenging with Avro/Thrift. Example: you decide to add a 'hasIphone6' key. Avro/Thrift: You will be forced to delete the row and create a new one with the added field. Under the column scheme, a new entry is appended, with only the new column. For a single row, not big, but if you do this to a billion rows, there will need to be a big compaction operation.

4) If configured, you can use compression in HBase, which may exceed the avro/thrift serialization, since it can compress across a column family, instead of just for the single record.

5) BigTable implementations like HBase do very well with very wide, sparse tables, so there won't be a performance hit like you might expect.

Is it better to use HBase columns or serialize data using Avro?

2 Answers