0
votes

I would like to have a recommendation on designing the Hbase table/RowKey for efficient search

Here, is a sample data set:

|   Column Families |   Column Qualifiers   |   Row 1               |   Row 2           |   Row 3       |
---------------------------------------------------------------------------------------------------------
|   Country         |           Code        |   US                  |                   |   UK          |
|
|                   |           Full Name   |United States of America|                  |United Kingdom |
|                   |           Capital     |Washington, DC         |                   |London         |
|                   |           Leader      | President             |   President       |Prime minister |

|State              |                       |   Texas               |                   |               |
|District           |                       |Houston                |LA                 |               |
|County             |                       |Harris                 |Harris             |   Cambridge   |
|City               |                       |Houston city           |Duke City          |               |
|Road               |                       |Bellare                |                   |Downing Street |
|Family             |                       |Doe                    |Wade               |               |

|Person             |       Name            |John Doe               |Doo                |Smith          |
|                   |       Location        |35.00 N, 99.00 W       |31.00 N , 100.00 W |               |
|                   |       Gender          |                       |Female             |Male           |
|                   |       Religion        |Atheist                |Maya               |Christian      |

note: In the above sample data set the column qualifiers are detailed only for the first and the last column family to avoid data cluttering.

Data overview: Its a hierarchical data set but the information at any level can be missing Example, Row 2 doesn't have the Country information Row 3, UK doesn't have the concept of State, so the State hierarchy will be missing for UK records

The requirement is to search for following scenarios:

|Search Criteria                    |Returned Records           |
-----------------------------------------------------------------
|All records                        |3 count, Row1, Row2 & Row3 |
|Country = USA                      |1 Record, Row 1            |
|Gender = Male                      |1 Record, Row 3            |
|County = Harris                    |2 records, Row 1 & Row 2   |
|LATITUDE > 30 and LONGITUDE < 101  |2 records, Row 1 & Row 2   |
|All Atheist for USA                |1 record, Row1             |

Proposed design solution:

1. Create eight column families for each level since there can be added information which needs to be searched at each level (example, the leaders name, position, timezone, area for level city, county, district, country etc.)

|Level one      |Country    |
|Level two      |State      |
|Level three    |District   |
|Level four     |County     |
|Level Five     |City       |
|Level Six      |Road       |
|Level Seven    |Family     |
|Level Eight    |Person     |
  1. Design the Rowkey as a composite key, a combination of all level codes since the search can be at any level Level1Code:Level2Code:Level3Code:Level4Code:Level5Code:Level6Code:Level7Code:Level8Code: example, US:101:102:103:104:105:106:107

Other alternative I am thinking is to create a secondary index but will have to explore it further on performance end since I plan to use Hbase as a backend for my web application.

Thanks in advance for sharing your expertise!

1

1 Answers

1
votes

About the number of column families I'd stick to the lower number of families as possible (HBase docs). Given your case I'd try with 2 families: one for location data, and another one for the person-related data, that way you can search by location without reading anything about the person or vice-versa. Just remind that if you scan only one family to speed up things, you can only retrieve the data of that family, so in this case if you look for people within US you won't be able to get their names unless you also add the other family, and if that's the case, there's no gain by having multiple families. Multiple families are used if your data need different configurations (TTL, versions, compression...) or if they can be queried independently without needing data from other families.

In the end it all depends on your most common queries and when you need to reduce the amount of data read: let's say you perform a lot of queries based on the geolocation, in that case I'd move Lat & Long columns to their own family to avoid reading anything else.

About the composite rowkeys: I'm not very sure about where the your LevelCode ids come from, are they normalized values stored elshewere?. Your rowkeys would seem pretty large if you store them as a string within your rowkey (any char will take 1 byte, if you have large LevelCode ids you'll end with +60 byte rowkeys), please notice that in HBase every cell includes the row key + column family + column + timestamp + value, so, it will be a pretty overhead in terms of storage with not much performance gain when querying.

If your LevelCode ids are normalized integers and you want to use them to filter data perhaps you should consider adding an ids family (with the values as integer columns), that way you could scan just that family and filter out what you don't need with a SingleColumnValueFilter.

If you have some spare time take a look to the great Introduction to HBase schema design by Amandeep Khurana.


About real-time answers

If you need to provide real-time answers to your queries both your approaches won't work very well given a good amount of data. HBase is a data storage, not a search engine, it can run searches and analysis jobs in huge datasets but they won't be real-time.

With a proper design, HBase can work as a real-time backend for simple searches, but you should follow these mantras when designing tables:

  • Avoid full scans at all costs, even if they have filters. Full scan = Read every row = Not suitable for real-time.
  • Write is fast. Denormalize and write as much data as you need in order to speed up data retrieval.

This means you need secondary indexes for each query you expect to run if you need to retrieve them in real-time. All of your scan queries should have at least a start row key, and that requires writing the same data to multiple tables under different rowkeys (or even to the same table by using prefixes for each index type, a practice I wouldn't recommend, it makes splitting very hard and could have hotspotting issues).

Please notice that some of your queries combine multiple fields ("All Atheist for USA" or "LATITUDE > 30 and LONGITUDE < 101"), in those cases you'll also need a secondary index for them and for every combination of them, which will make things a lot more complex if you want to handle them with HBase.

I tend not to recommend switching to other systems because usually things can be done with more or less effort, but, based on your use case, I think it would be better to opt for a search engine that takes care of the indexing by itself.

Perhaps you'll find elasticsearch useful for your task: is fast, easy to learn, flexible, reliable, scalable and has a nice API in a lot of languages.