Google Bigtable - how to store data to be able then to query it by country, city and date?

Question

I'm looking for a best scheme for my Bigtable to fit the requirement of querying data by specifying a year (or full date), country and (or) city. I've thought about naming my row key as 2019.us.NYC and then query it by prefixes, but its a bad idea because its then will store all my data on one node only until 2020 year comes instead of sharing it to other nodes. Any ideas? Maybe someone already had such a case? The bottleneck here is that it will have about 50 000 000 new rows per second.

Edit: Maybe it's better to use BigQuery?

To answer your last question, BigQuery have quotas on streaming inserts (cloud.google.com/bigquery/quotas#streaming_inserts) : 100'000 rows per second per project, so Bigtable is probably the database you want to use. Could you please clarify your data querying requirements? In your question, are these exclusive or, or can you query on both full date and country for example? Do you need low latency for data access? Please also include a sample of data you want to store to help you design your Bigtable schema. — norbjd
@norbjd Thanks for your help, first of all. The requirement is to be able to query data by specifieng at least year and country. Additionaly can be speciafied mounth, day and city name. Low latency is very important. I plan to store simple json value that contains everything I need, but it's will be much better if I'll be able to store the data directly to the table families. — Mmh1
@norbjd In MySQL table this query would look something like this: SELECT * FROM my_table WHERE country = 'us' AND city = 'new_york' AND date = '2019-08-12' ORDER BY id DESC LIMIT 100 or: SELECT * FROM my_table WHERE country = 'de' AND city = 'berlin' AND YEAR(date) = '2017' ORDER BY id DESC LIMIT 100 — Mmh1
BigTable forces a single key. You want to have a combined key with another value on the end. So maybe something like '2019.us.Nyc.{guid}' where guid is some unique identifier you create per value. You then can use a simple range operator that should be able to be nearly as fast as possible if your wanting the entire group.Then you could do between '2019.us.nyc.{empty-guid}' and '2019.us.nyc.{max-guid}' to find all the data while still having unique rows. Look at range filters cloud.google.com/bigquery/external-data-bigtable — Chase R Lewis
To answer to the above comment, using 2019.us.Nyc.{guid} as a rowkey will lead to Bigtable node hotspotting (consecutive writes on a same node), because the row keys will be contiguous, as @Mmh1 correctly understood and wrote in his question. Moreover, querying the data from Bigquery to access Bigtable data is probably not the solution if low latency reads are required. — norbjd

norbjd norbjd · Accepted Answer · 2019-08-13T18:45:44

Below is a possible solution, based on your requirements :

50'000'000 writes per second
low latency for data access
queries always contain year and country, optionally month, day and city

`year` + `country`

Since year and country are always present in your queries, those must figure in the beginning of the row key, like :

2018#de
2019#de
2019#us

(I used # to delimit here, but it may be useless if year and country are always defined respectively on 4 and 2 characters. You can remove it to save a byte!).

`month` + `day` + `city`

Because month, day and city are optional, they can also appear on the row key, but rather to the end :

2018#de#0610#frankfurt
2019#de#0115#berlin
2019#us#0813#nyc

I suggest you to reorder elements as you want (if queries with year, country and city are the most common, then the order should be year#country#city). Only you can know the most frequent queries. It's always necessary to design your row key with your queries in mind.

Avoid hotspotting

But, as you suggested in your question, this row key design can lead to Bigtable node hotspotting (all writes to a single node because the row keys are contiguous). To solve this and ensure a perfect distribution of the row keys between your nodes, I suggest you to use bucketing.

For each write, you can generate a random number (between 0 and 8, for example if you want 8 buckets), and prepend that bucket number to your rowkey. For example :

3#2018#de#0610#frankfurt
2#2019#de#0115#berlin
7#2019#us#0813#nyc

You'll then be sure that your keys will be correctly distributed across your Bigtable nodes when writing.

You can check this link on how to do this on HBase (Bigtable equivalent) : https://hbase.apache.org/book.html#schema.casestudies.log_timeseries.tslead.

Querying data

But because of this bucketing (or salting), you'll need to change the way you query your table. If you want all data for 2019 in US, you'll then need to perform 8 scans (one per bucket) :

start key : 0#2019#us#, end key : 0#2019#us~
start key : 1#2019#us#, end key : 1#2019#us~
start key : 2#2019#us#, end key : 2#2019#us~
start key : 3#2019#us#, end key : 3#2019#us~
start key : 4#2019#us#, end key : 4#2019#us~
start key : 5#2019#us#, end key : 5#2019#us~
start key : 6#2019#us#, end key : 6#2019#us~
start key : 7#2019#us#, end key : 7#2019#us~

(I used ~ at the end of the end key because ~ in the ASCII table is after all possible characters after the #. For the first scan, for example, this ensures that all row keys beginning by 0#2019#us# are retrieved)

These scans can be performed in parallel for maximum performance.

Scanning is the most performant way to query data in Bigtable. You could also use some filters (like FuzzyRowFilter to query on a row key with a particular regex), but scanning will give definitively give you a better latency. You can also perform scans and use a filter after scanning (for example, to retrieve all data for 2019 in us in nyc, a filter is necessary to get only lines with city = nyc).

Conclusion

So, based on these elements, I'll design my key like :

<bucket_number>#<year>#<country>#<month><day>#<city>

to query my table using scans. Separators (# here) are useless if all fields have fixed length.

You could also have some variants without bucketing if you have a sufficient number of <country> values to distribute the keys to the different nodes :

<year>#<country>#<month><day>#<city>

or :

<country>#<year>#<month><day>#<city>

In conclusion, it's always a tradeoff when designing Bigtable row keys. By using bucketing, you always avoid hotspotting but the way you query the data is more complex. But, based on your requirements (many writes), this is what I'll do.

You can change the number of buckets depending on your number of nodes in your Bigtable cluster also. If you have more than 8 nodes, I recommend you to create more buckets. Ideally, 1 bucket = 1 node but a node can easily contain multiple buckets.

I suggest though to test this key design with others and benchmark them in real conditions (PoC). You could use the Bigtable Key Visualizer to check the distribution of your keys across your cluster.