3
votes

In my HBase table, each row may be have different columns than other rows. For example;

ROW                       COLUMN
1-1040                    cf:s1
1-1040                    cf:s2
1-1043                    cf:s2
2-1040                    cf:s5
2-1045                    cf:s99
3-1040                    cf:s75
3-1042                    cf:s135

As seen above, each row has different columns than other rows. So, when I run scan query like this;

scan 'tb', {COLUMNS=>'cf:s2', STARTROW=>'1-1040', ENDROW=>'1-1044'}

I want to get cf:s2 values using above query. But, does any performance issue occur due to each row has different columns?

Another option;

ROW                       COLUMN
1-1040-s1                 cf:value
1-1040-s2                 cf:value
1-1043-s2                 cf:value
2-1040-s5                 cf:value
2-1045-s99                cf:value
3-1040-s75                cf:value
3-1042-s135               cf:value

In this option, when I want to get s2 values between 1-1040 and 1-1044, I am running this query for this;

scan 'tb', {STARTROW=>'1-1040s2', ENDROW=>'1-1044', FILTER=>"RowFilter(=, 'substring:s2')"}

When I want to get s2 values, which option is better in read performance?

1

1 Answers

4
votes

HBase stores all records for a given column family in the same file, and so the scan has to run over all key-value pairs, even if you apply a filter. This is true of both ways you suggest for storing the data.

For optimal performance of this specific scan, you should consider storing your s2 data in a different column family. Under-the-hood, HBase will store your data in the following way:

One file:

1-1040                    cf1:s1
2-1040                    cf1:s5
2-1045                    cf1:s99
3-1040                    cf1:s75
3-1042                    cf1:s135

Another file:

1-1040                    cf2:s2
1-1043                    cf2:s2

Then you can run a scan over just cf2, and HBase will only read data containing s2, making the operation much faster.

scan 'tb', {COLUMNS => 'cf2', STARTROW=>'1-1040s2', ENDROW=>'1-1044'}

Considerations:

  • It's recommended to only have two or three column families per table, so you shouldn't implement this if you want to run this query for s5, s75 etc. In this case, your composite rowkey option is better as HBase only need look at the rowkey, and not column qualifiers.
  • It depends on exactly which queries you'll be running, and how often you'll be running them. This is the fastest way for you to get values associated with s2, but might not be fastest for other queries.