2
votes

Bigtable row key scenario to avoid hotspotting?

A company needs you to create a schema in Google Bigtable that will allow for the historical analysis of the last 2 years of records. Each record that comes in is sent every 15 minutes, and contains a unique identifier of the device and a data record. The most common query is for all the data for a given device for a given day. Which schema should you use?

  • A. Rowkey: date#device_id, Column data: data_point
  • B. Rowkey: date, Column data: device_id, data_point
  • C. Rowkey: device_id, Column data: date, data_point
  • D. Rowkey: data_point, Column data: device_id, date
  • E. Rowkey: date#data_point, Column data: device_id

What would be the best option in above?

2
Is this a homework question? Please see How do I ask and answer homework questions? for more information. If so, please annotate it as such. What do you think it should be and why? Have you looked at the docs on Bigtable schema design? - Misha Brukman
Thank you. Yes, I am aware of the norms as per the Big table documentation. But this question's answer options were confusing. - Roshan Fernando
If you've read the schema design documentation, you should be able to eliminate some obviously bad choices from the list, leaving a few, and then you can decide which is the best from the remaining set, given the most frequent types of queries you want to run — for each schema and for each query (here there's only 1 such query), consider how you can implement such a query, and see which schema provides the most efficient implementation of such a query. Hope this helps. - Misha Brukman
Yes, it cannot be A, B and E due to it starting with Date, which could cause lexicographical sort filling each node. I am confused with C or D option here. Just deviceid could cause multiple versions of Column data for date and data point. Having data_point as row key, i am not sure if that could be really be queryable. Can you provide inputs on which out of C or D would be a best option. [I am aware a row key of deviceid#date would be best option here, but would like to know the best from the available options above] - Roshan Fernando
Bigtable has built-in cell versionining: every value in a column has an associated 64-bit timestamp, so you can write multiple data points, keyed by time, within a single row, for a single column. Does that help? - Misha Brukman

2 Answers

4
votes

According to the Bigtable schema documentation:

Rows are sorted lexicographically by row key.

This means that in order to avoid hotspotting, common queries should return row results that sequential.

Essentially, you want to be querying rows with a given date and device id. Google Cloud Bigtable allows you query rows by a certain row key prefix. Since the most common queries all the data for a given device and date, the device and date need to be part of the row prefix query, and must be the first two entries in a row key.

0
votes

You have 2 kind of solution. Big Table make a Lexigoraphy dictionary using the rowkeys enforcing the organization about

1 - Add before each rowkey(prefix) a letter to force the Big Table make a lexicography index spreading your rows across the alphabet letters and avoid a colision during i/o. This technique is called Salted Table.

Ex.

123 456 789 101112 131415

a123 a456 b789 b101112 c131415

2- You can use a MD5 Hash, avoiding repeat the prefix before hasing and this way garantee a variety of prefix and this way The Big Table spreads the rowkeys across the instance´s disk.