I have an HBase schema-design related question. The problem is fairly simple - I am storing "notifications" in hbase, each of which has a status ("new", "seen", and "read"). Here are the API's I need to provide:
- Get all notifications for a user
- Get all "new" notifications for a user
- Get the count of all "new" notifications for a user
- Update status for a notification
- Update status for all of a user's notifications
- Get all "new" notifications accross the database
- Notifications should be scannable in reverse chronological order and allow pagination.
I have a few ideas, and I wanted to see if one of them is clearly best, or if I have missed a good strategy entirely. Common to all three, I think having one row per notification and having the user id in the rowkey is the way to go. To get chronological ordering for pagination, I need to have a reverse timestamp in there, too. I'd like to keep all notifs in one table (so I don't have to merge sort for the "get all notificatiosn for a user" call) and don't want to write batch jobs for secondary index tables (since updates to the count and status should be in real time).
The simplest way to do it would be (1) row key is "userId_reverseTimestamp" and do filtering for status on the client side. This seems naive, since we will be sending lots of unecessary data through the network.
The next possibility is to (2) encode the status into the rowkey as well, so either "userId_reverseTimestamp_status" and then doing rowkey regex filtering on the scans. The first issue I see is needing to delete a row and copy the notification data to a new row when status changes (which presumably, should happen exactly twice per notification). Also, since the status is the last part of the rowkey, for each user, we will be scanning lots of extra rows. Is this a big performance hit? Finally, in order to change status, I will need to know what the previous status was (to build the row key) or else I will need to do another scan.
The last idea I had is to (3) have two column families, one for the static notif data, and one as a flag for the status, i.e. "s:read" or "s:new" with 's' as the cf and the status as the qualifier. There would be exactly one per row, and I can do a MultipleColumnPrefixFilter or SkipFilter w/ ColumnPrefixFilter against that cf. Here too, I would have to delete and create columns on status change, but it should be much more lightweight than copying whole rows. My only concern is the warning in the HBase book that HBase doesn't do well with "more than 2 or 3 column families" - perhaps if the system needs to be extended with more querying capabilities, the multi-cf strategy won't scale.
So (1) seems like it would have too much network overhead. (2) seems like it would have wasted cost spent copying data and (3) might cause issues with too many families. Between (2) and (3), which type of filter should give better performance? In both cases, the scan will have look at each row for a user, which presumably has mostly read notifications - which would have better performance. I think I'm leaning towards (3) - are there other options (or tweaks) that I have missed?