12
votes

Probably there are a lot of similar questions but they dont' answer to my scenario (at least I'm not able to get the point).

  • I have, lets say, a table in HBase with 4 column families. Main reason is that each column family has different VERSIONS attribute (very different).

  • All column of all families are not storing big data (such for example fulltexts) but an average of 1KB (identifiers that are long, some short strings, integers and so on)

  • I need to access data in several ways: scan and get by column family, get all cells of a given row by version (specific version or a range), and last but not least: get the latest version of all columns of a given row.

So, what are, in this scenario, the disadvantages of having 4 column families? Does reads are less efficient because they operate (in case the row is not in memory) on different store files?

4

4 Answers

13
votes

There is a limit to the number of column families in HBase. There is one MemStore(Its a write cache which stores new data before writing it into Hfiles) per Column Family, when one is full, they all flush.

The more you add column families there will be more MemStore created and Memstore flush will be more frequent. It will degrade the performance.

10
votes

The idea behind column families is great - unfortunately the current HBase implementation does not handle a lot of column families well. Basically you should try to stick with one and add a second if you have radically different access patterns. Also see HBase manual

What you can do is keep your different "family" as columns with different prefix. HBase is sparse so it won't take more space and you can still get just one "family" with a columnPrefix filter on scans if you need to

7
votes

As per Apache HBase wiki Hbase will face performance issues more than 2 or 3 Column families.

1
votes

When the MemStore accumulates enough data, the entire sorted set is written to a new HFile in HDFS. HBase uses multiple HFiles per column family, which contain the actual cells, or KeyValue instances. These files are created over time as KeyValue edits sorted in the MemStores are flushed as files to disk.

Note that this is one reason why there is a limit to the number of column families in HBase. There is one MemStore per CF; when one is full, they all flush. It also saves the last written sequence number so the system knows what was persisted so far. The more you add column families there will be more MemStore created and Memstore flush will be more frequent.