I want to use HBase as a database for my application. I have a table which has multiple columns. I now need to decide how many column families should I use, one or more. If more than one, what will be advantages and disadvantages.
1 Answers
It's already documented in the official HBase guide, take a look at the statements in bold:
- On the number of column families
HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing and compactions are done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed though the amount of data they carry is small. When many column families the flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by changing flushing and compaction to work on a per column family basis). For more information on compactions, see compaction.
Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.
33.1. Cardinality of ColumnFamilies
Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.
One good example would be to have an analytics table with Daily, Monthly, Yearly and Total column families, each one with their own TTL settings (expiration) and columns for each date ranges (days, months, years...), they're different scopes and when you query the table, you usually fetch only one type of aggregation at a time, i.e.: retrieve daily stats of last 30 days
If you want to know more about schema design take a look at the great Introduction to HBase schema design by Amandeep Khurana