5
votes

Because HBase tables are sparse tables, HBase stores for every cell not only the value, but all the information required to identify the cell (often described as the Key, not to be confused with the RowKey). The Key looks as follows:

RowKey-ColumnFamily-ColumnQualifier-Timestamp

And all this information is stored for every entry. That's why there is the recommendation to use short names for Column Families and Column Qualifiers to reduce additional overhead.

My Question: Why do I need to store the ColumnFamily for every entry? From my understanding every Store File belongs to exactly one Column Family. Wouldn't it be enough to store the Column Family name once per Store File? This would reduce overhead, arbitrary Column Family names could be used and we would still be able to identify the Column Family for every entry. What am I missing here?

2
Not only is the column family stored for every value - also the rowkey is stored for every value.ytoledano

2 Answers

0
votes

I think the reason is probably just due to simplicity and the fact that the key structure directly maps to the RPC representation. It would require more internal copying and translation to drop the column family before writing it and recreate it after reading it. I'm guessing the performance trade off is more significant than it sounds, but I don't know if the HBase devs have tried this particular variation out. I do know that if you are concerned about the space of your column families and columns that you can turn on data block encoding to minimize the overhead. You could also check out the Kiji project which handles shortening these names for you as well as providing translation layers for you code, which means you can still use longer names without worrying about the cost.

1
votes

Like a relational database, tables in HBase consist of rows and columns. In HBase, the columns are grouped together in column families. This grouping is expressed logically as a layer in the map of maps. Column families are also expressed physically. Each column family gets its own set of HFiles on disk. This physical isolation allows the underlying HFiles of one column family to be managed in isolation of the others. As far as compactions are concerned, the HF iles for each column family are managed independently.