Below is a possible solution, based on your requirements :
- 50'000'000 writes per second
- low latency for data access
- queries always contain
year and country, optionally month, day and city
year + country
Since year and country are always present in your queries, those must figure in the beginning of the row key, like :
(I used # to delimit here, but it may be useless if year and country are always defined respectively on 4 and 2 characters. You can remove it to save a byte!).
month + day + city
Because month, day and city are optional, they can also appear on the row key, but rather to the end :
2018#de#0610#frankfurt
2019#de#0115#berlin
2019#us#0813#nyc
I suggest you to reorder elements as you want (if queries with year, country and city are the most common, then the order should be year#country#city). Only you can know the most frequent queries. It's always necessary to design your row key with your queries in mind.
Avoid hotspotting
But, as you suggested in your question, this row key design can lead to Bigtable node hotspotting (all writes to a single node because the row keys are contiguous). To solve this and ensure a perfect distribution of the row keys between your nodes, I suggest you to use bucketing.
For each write, you can generate a random number (between 0 and 8, for example if you want 8 buckets), and prepend that bucket number to your rowkey. For example :
3#2018#de#0610#frankfurt
2#2019#de#0115#berlin
7#2019#us#0813#nyc
You'll then be sure that your keys will be correctly distributed across your Bigtable nodes when writing.
You can check this link on how to do this on HBase (Bigtable equivalent) : https://hbase.apache.org/book.html#schema.casestudies.log_timeseries.tslead.
Querying data
But because of this bucketing (or salting), you'll need to change the way you query your table. If you want all data for 2019 in US, you'll then need to perform 8 scans (one per bucket) :
- start key :
0#2019#us#, end key : 0#2019#us~
- start key :
1#2019#us#, end key : 1#2019#us~
- start key :
2#2019#us#, end key : 2#2019#us~
- start key :
3#2019#us#, end key : 3#2019#us~
- start key :
4#2019#us#, end key : 4#2019#us~
- start key :
5#2019#us#, end key : 5#2019#us~
- start key :
6#2019#us#, end key : 6#2019#us~
- start key :
7#2019#us#, end key : 7#2019#us~
(I used ~ at the end of the end key because ~ in the ASCII table is after all possible characters after the #. For the first scan, for example, this ensures that all row keys beginning by 0#2019#us# are retrieved)
These scans can be performed in parallel for maximum performance.
Scanning is the most performant way to query data in Bigtable. You could also use some filters (like FuzzyRowFilter to query on a row key with a particular regex), but scanning will give definitively give you a better latency. You can also perform scans and use a filter after scanning (for example, to retrieve all data for 2019 in us in nyc, a filter is necessary to get only lines with city = nyc).
Conclusion
So, based on these elements, I'll design my key like :
<bucket_number>#<year>#<country>#<month><day>#<city>
to query my table using scans. Separators (# here) are useless if all fields have fixed length.
You could also have some variants without bucketing if you have a sufficient number of <country> values to distribute the keys to the different nodes :
<year>#<country>#<month><day>#<city>
or :
<country>#<year>#<month><day>#<city>
In conclusion, it's always a tradeoff when designing Bigtable row keys. By using bucketing, you always avoid hotspotting but the way you query the data is more complex. But, based on your requirements (many writes), this is what I'll do.
You can change the number of buckets depending on your number of nodes in your Bigtable cluster also. If you have more than 8 nodes, I recommend you to create more buckets. Ideally, 1 bucket = 1 node but a node can easily contain multiple buckets.
I suggest though to test this key design with others and benchmark them in real conditions (PoC). You could use the Bigtable Key Visualizer to check the distribution of your keys across your cluster.
SELECT * FROM my_table WHERE country = 'us' AND city = 'new_york' AND date = '2019-08-12' ORDER BY id DESC LIMIT 100or:SELECT * FROM my_table WHERE country = 'de' AND city = 'berlin' AND YEAR(date) = '2017' ORDER BY id DESC LIMIT 100- Mmh12019.us.Nyc.{guid}as a rowkey will lead to Bigtable node hotspotting (consecutive writes on a same node), because the row keys will be contiguous, as @Mmh1 correctly understood and wrote in his question. Moreover, querying the data from Bigquery to access Bigtable data is probably not the solution if low latency reads are required. - norbjd