10
votes

I have 2 HBase tables - one with a single column family, and other has 4 column families. Both tables are keyed by same rowkey, and the column families all have a single column qualifier each, with a json string as value (each json payload is about 10-20K in size). All column families use fast-diff encoding and gzip compression.

After loading about 60MM rows to each table, a scan test on any single column family in 2nd table takes 4x the time to scan the single column family from 1st table. Note that the scan on 2nd table uses addFamily to limit scan to only 1 column family, and both tests scan 1MM rows exactly - so the net workload (and hence performance expectation) should be the same in both cases. However, tests show 4x time on any column family in 2nd table vs 1st table. Performance did not change much even after running a major compaction on both tables.

Though HBase doc and other tech forums recommend not using more than 1 column family per table, nothing I have read so far suggests scan performance will linearly degrade based on number of column families. Has anyone else experienced this, and is there a simple explanation for this?

To note, the reason second table has 4 column families is even though I only scan one column family at a time now, there are requirements to scan multiple column families from that table given a set of rowkeys.

Thanks for any insight into the performance question.

1
You didnt tell us about blockcache configuration in both cases which, I think, is the key to answering your question.mazaneicha

1 Answers

2
votes

That's a normal behavior, if I've got your situation right. Since each column family represents a separate Store on RegionServer, accessing multiple stores takes more time.

You can limit your scan to specific column families, use addFamily on your scan object.