HBase schema for storing timeseries user data

Question

I'm currently working on prototyping a solution for storing user's current location history into a HBase table. (Assume there are hundreds of millions of users). Each user's trial of locations are stored in a HBase table. This trail of locations are then utilized as part of few offline data analysis jobs.

Following are the 2 main data access patterns:

I should be able to scan through all or a subset of locations (based on time range) of a specific user from the stored location trial.
For offline data analysis, I should be able to scan through all locations of all users within a time range.

Given the above requirements, I came up with the following row-key design:

<uid>_<timestamp>

where 'uid' represent the user-id and 'timestamp' represent the time at which the location was detected and saved.

With this row-key design, achieving access pattern #1 is straight forward - scan request can have a start-key and end-key with the given time stamp appended to a specific uid.

However, the tricky part is with access pattern #2 with which I'm seeking help from the HBase experts. Since, I need to scan for all users say for last 6 months, I will end up not using any keys with scan operation. This has an impact of scanning through the entire HBase table. Which I feel is inefficient. Moreover, my data size is expected to grow sooner with a write load of 2K/sec.

I had a look at OpenTSDB which was pointed by many of people in open forums. But I'm not able to relate that solution on to my data access patterns.

I'm looking for help in optimizing this schema which would result in avoiding the full table scan.

do you have a limit to timestamp? that is, you will keep only the last timestamps in the past 6 months? — Udy
Yes. I may keep the data for last one year. I'll be configuring the TTL that way. — Prashanth G N
for data access pattern #2 - would the timerange should be flexible or is it fixed? — Udy
The time ranges are something like - for a past one or few weeks, past one or few months. — Prashanth G N

mwebster mwebster · Accepted Answer · 2013-12-25T03:38:07

Instead of storing each location point in a single row, you could store each location in it's own column, with a one year TTL. This is a similar idea to how OpenTSDB does it's bucketing of metrics, where for a certain time window each reading of a metric is stored in a separate column.

This schema would allow you to scan over all of your users and inside of your scanning job, manually filter out dates that you don't care about. This is still a full table scan, but only over the set of your users, not the set of all your locations.

This schema also has the advantage of allowing just a single get or small scan, see below, for a user to access their entire location history.

The downside of this schema revolves around the size of your rows for each user. If each user has a few hundred or thousand data points, you should be ok. But, if each user has millions of locations, your row sizes could grow to the same size as your region. Since HBase never splits rows across regions, you would wind up with regions consisting of a single row, which is not optimal.

To fix this, you need to implement your own bucketing of checkin data for each user like OpenTSDB does. Say each bucket is uid+weekOfTheYear+year. The bucket granularity is heavily dependent on how often users are adding location data. This creates multiple rows per user, and thus requires a scan over each bucket for a given user. To access data for a specific date range, just use the timestamp filtering builtin to Scanners.

HBase schema for storing timeseries user data

3 Answers