1
votes

I'm currently working on prototyping a solution for storing user's current location history into a HBase table. (Assume there are hundreds of millions of users). Each user's trial of locations are stored in a HBase table. This trail of locations are then utilized as part of few offline data analysis jobs.

Following are the 2 main data access patterns:

  1. I should be able to scan through all or a subset of locations (based on time range) of a specific user from the stored location trial.

  2. For offline data analysis, I should be able to scan through all locations of all users within a time range.

Given the above requirements, I came up with the following row-key design:

<uid>_<timestamp>

where 'uid' represent the user-id and 'timestamp' represent the time at which the location was detected and saved.

With this row-key design, achieving access pattern #1 is straight forward - scan request can have a start-key and end-key with the given time stamp appended to a specific uid.

However, the tricky part is with access pattern #2 with which I'm seeking help from the HBase experts. Since, I need to scan for all users say for last 6 months, I will end up not using any keys with scan operation. This has an impact of scanning through the entire HBase table. Which I feel is inefficient. Moreover, my data size is expected to grow sooner with a write load of 2K/sec.

I had a look at OpenTSDB which was pointed by many of people in open forums. But I'm not able to relate that solution on to my data access patterns.

I'm looking for help in optimizing this schema which would result in avoiding the full table scan.

3
do you have a limit to timestamp? that is, you will keep only the last timestamps in the past 6 months? - Udy
Yes. I may keep the data for last one year. I'll be configuring the TTL that way. - Prashanth G N
for data access pattern #2 - would the timerange should be flexible or is it fixed? - Udy
The time ranges are something like - for a past one or few weeks, past one or few months. - Prashanth G N

3 Answers

1
votes

Instead of storing each location point in a single row, you could store each location in it's own column, with a one year TTL. This is a similar idea to how OpenTSDB does it's bucketing of metrics, where for a certain time window each reading of a metric is stored in a separate column.

This schema would allow you to scan over all of your users and inside of your scanning job, manually filter out dates that you don't care about. This is still a full table scan, but only over the set of your users, not the set of all your locations.

This schema also has the advantage of allowing just a single get or small scan, see below, for a user to access their entire location history.

The downside of this schema revolves around the size of your rows for each user. If each user has a few hundred or thousand data points, you should be ok. But, if each user has millions of locations, your row sizes could grow to the same size as your region. Since HBase never splits rows across regions, you would wind up with regions consisting of a single row, which is not optimal.

To fix this, you need to implement your own bucketing of checkin data for each user like OpenTSDB does. Say each bucket is uid+weekOfTheYear+year. The bucket granularity is heavily dependent on how often users are adding location data. This creates multiple rows per user, and thus requires a scan over each bucket for a given user. To access data for a specific date range, just use the timestamp filtering builtin to Scanners.

0
votes

One simple thing you can do is promote some of the time in the key - e.g. add a month prefix in this case the regular queries may need to look at multiple scans (assuming in the common use you only want the latest records than mostly it won't be a problem) but the longer running will be capped by the months.

By the way, if in the regular use you want the latest records you may want to store the dates from newest to oldest (maxlong - timestamp) so that queries on time range would be faster

0
votes

Personally, I'll avoid using time based prefixes in your row keys.

Let me point you into another direction, can you afford having data duplication?.

If the answer is YES, just create another table with the tiniest required data for your jobs and the TTL set to 6 MONTH (and another one with 3 MONTH TTL if you need so), and write to all tables at once (you can buffer writes to that tables as much as you want). Also, is your table has a few families, you can just add the short-lived families to the same table, but I rather prefer to have different ones for that (personal preference).

If the answer is NO, you could still do timestamp based range scan to avoid reading as much data as possible. If (as you're saying) the table will have a 1 year TTL you can afford to do it, it's not like having to do a full table scan of 30 years of data for just retrieving a few days.

BTW, I'll recommend you to include at least a 2-3 byte prefix based on the numeric uuid (modulo, crc32, md5...) in order to get an even distribution among regions and deal better with inactive (or very active) users. There's no way you can predict how active your users are going to be.