I have a requirement to store events generated by a user identified by userId. Each user belongs to a company which is identified by companyId. I have come up with a design for table in HBase as following:
rowkey: <companyId><userId><timestamp>
column-family: info (encapsulating set of event attributes as shown below)
columns: <attr1>, <attr2>....<attrn>
I know that this key design will facilitate querying data later on on companyId and/or userId by using partial key scans. Having said that, I have some questions and concerns and wanted to get some ideas.
1- If we have a read-use-case that read all data given a time range then with this current design we will not be able to use the rowKey. Instead we will have to do full scan and filter rows on the timestamp field (maintained separately as one of the attr columns) Am I totally off-base here?
2- How to handle duplicates? I know HBase will in that case create a new version of the row but will it allow reading later on according to the read-usecase mentioned in 1? I know you can control the versions when you query but will it be a good design or overloading a native functionality incorrectly?
3- This is concerning region server hotspotting. We don't have monolithic keys but we can still run in to this issue if say, one specific company or user is very active. The hashing and bucketing based on number of servers will work not in this case? Maybe if we hash on the timestamp field and append that to the rowKey rather than the original value? But then the issue would be that scanning on the timestamp component of the key would not be possible. We will have to have a separate column (attr) in a column to capture that. Any suggestions?
Thanks a lot for any input (comment, link, book, idea) that can be provided.