3
votes

I have a table with 1 column family called 'A'. On runtime, I will insert the (Key-Value) pair to the table. Leave the RowKey away, in my design, Column qualifier is MD5(Key) so, column qualifers are dynamically created, and the cell will contains the corresponding Value.

E.g: Each car has a license plate. I want to insert all to one table in HBase. Car A has rowkey R1, column qualifier is C1, value is License Plate of A. Car B has rowkey R2, , column qualifier is C2, value is License Plate of A, and vice versa. With the schema, When executing Scan command, with rowkey = R1, is cell contained in column qualifier C2 return (in this case, it is definite null)?

I want to ask some questions about performances:

  1. With this schema design, Does Scan command's performance decrease? (I want to scan all values on the table). With each row, is all column will be returned?

  2. With the above requirements, can anyone point me the right way to design this table?

Thank you very much!

4

4 Answers

2
votes

No, the performance of scan will not decrease.That is the beauty of HBASE.

I have dealt with similar kind of structure and huge data set and the retrieval was amazingly quick.

I think for dealing with such scenario, the different filters in HBASE would help a lot.

You can also refer about HBASE filter's from HBASE:Defenitive guide. One of the good filters in HBASE is the prefix filter. If you are working in JAVA it would look somewhat like this,

Scan s = new Scan();
Filter filter = new PrefixFilter(Bytes.toBytes("car_"+i));
s.setFilter(filter);

Here the rowkeys for different car's can be "car_[liscence number OR car number]".So that even if you want to extract only one row out of lakhs of rows,this can be done in some seconds.

2
votes

Having many, fine-grain cells can sometimes be your enemy since the row key, family and qualifier (which combine to make the actual "key") can be heavily duplicated. This increases your data's space footprint, which in turn affects access speed.

If this problem applies to you, you could consider merging logical cells together into larger, physical "multi-cells" in a few different ways:

  • By packing sibling fields into "structs", the way you might combine field members into a class
  • By joining cells that have a common qualifier prefix (say, the first half of each MD5.) This is especially applicable if prefix-similarity implies access locality.

There is an OpenTSDB slide deck that discusses how it incorporates similar ideas.

Note that newer versions of HBase may allow you to use a trie-based data block encoding. This data structure would naturally help eliminate on-disk prefix redundancy, relieving the need for these kinds of schema tricks. See HBASE-4676 and HBASE-7162.

1
votes

HBase stores data in a sparse format. Every cell is stored as 'Key, Column Family, Column Qualifier, version, value' Scans over the table only produce column qualifiers for which there are values. Even though your design specifies column qualifiers that are essentially unique across your entire table, during a scan over the table each row will produce exactly one value (according to your description), and no extraneous null values will be returned for column qualifiers that are only defined on another row.

You have described a design for the table already. You can implement it without any further issues. A design question needs to be phrased in terms of the use cases to understand whether you have chosen a design that will perform well.

0
votes

I want to ask some questions about performances:

  1. With this schema design, Does Scan command's performance decrease? (I want to scan all values > on the table). With each row, is all column will be returned?

  2. With the above requirements, can anyone point me the right way to design this table?

  1. No. Only columns which have been added for this particular row-key.
  2. Can you answer why do you need dynamically created qualifiers for? I suggest using the same qualifier name for all row-keys. For example, you can have column family 'car-info' and an qualifier - 'license-plate', as well as 'make', 'model', 'year' etc.