6
votes

I have a requirement to store events generated by a user identified by userId. Each user belongs to a company which is identified by companyId. I have come up with a design for table in HBase as following:

rowkey: <companyId><userId><timestamp>

column-family: info (encapsulating set of event attributes as shown below)

columns: <attr1>, <attr2>....<attrn>

I know that this key design will facilitate querying data later on on companyId and/or userId by using partial key scans. Having said that, I have some questions and concerns and wanted to get some ideas.

1- If we have a read-use-case that read all data given a time range then with this current design we will not be able to use the rowKey. Instead we will have to do full scan and filter rows on the timestamp field (maintained separately as one of the attr columns) Am I totally off-base here?

2- How to handle duplicates? I know HBase will in that case create a new version of the row but will it allow reading later on according to the read-usecase mentioned in 1? I know you can control the versions when you query but will it be a good design or overloading a native functionality incorrectly?

3- This is concerning region server hotspotting. We don't have monolithic keys but we can still run in to this issue if say, one specific company or user is very active. The hashing and bucketing based on number of servers will work not in this case? Maybe if we hash on the timestamp field and append that to the rowKey rather than the original value? But then the issue would be that scanning on the timestamp component of the key would not be possible. We will have to have a separate column (attr) in a column to capture that. Any suggestions?

Thanks a lot for any input (comment, link, book, idea) that can be provided.

1

1 Answers

3
votes

1: Read use case

It depends on your use case:

  • If you wish to fetch every users data for an Org in a given time range, then what you have seems correct to me, and you'll have to run a scan over all of the orgs data.

  • If you wish to read all data for a given your current key design seems fine. Although I would flip the org and user id position making the new key (rowkey: userId-companyId-timestamp). This will since the data from independent users are disjoint these now need not be coupled together.

  • If you push the timestamp at the top(rowkey: timestamp-companyId-userId), you may be able to run a scan over all orgs / all users info ending at a location defined by the time range (skipping a full table scan)

2: Duplication

BEWARE: Hbase by default records upto 3 version of a cell (Also do not confuse these version timestamps with the timestamps on you rowkey). You can increase this limit and fetch results from different versions as well, however it is not recommended that this version count be a high number.

If you are going to write over your previously saved values, I would recommend not relying on looking up the previous version saved (although there are ways of achieving this). You could alternatively use a new column to store the new value if you must be able to save/fetch all previously recorded data.

3: Hot regions

  • IF a company is very active, you could append a hash of companyId-userId to your rowkey. This would distribute the writes on any org.

  • IF a user is very active and there is a use case to fetch all of its data back in an optimal manner, then I'm not sure hashing over the key or timestamp is a good solution. You would definitely want to keep the data for the user together and I'm not sure what the better solution here would be.

Base on how I understand your problem I would probably design the ROWKEY as HASH(companyId-UserId)-companyId-UserId-Timestamp