4
votes

Best practices for using Spanner strongly recommends avoiding the use of a timestamp or other sequential identifier as the first part of a key as this will create hotspots. One of the suggested workarounds when a time based ordering is required is to prefix this with a numerical shard based on the individual key, to get an even distribution (as in this page).

As I understand it, Spanner will automatically create splits based on the key (e.g. in this case, the shard), and when performing a query that gets all rows after a certain timestamp, it may need to run the query on all of the individual splits and then join.

The question finally: Is there a cost proportional to the number of unique shards, so that if I use 1024 shards the cost of querying the table may be higher than if using 16 shards, or is it down to splits and Spanner will break out the keys across splits only when needed?

As an extreme example, would there be a cost to actually using the individual entry id as the first part of the key, rather than a shard (other than the fact that one is a number and one is a string)? Doing so would create many many more "shards", but again, the impact would depend on whether the relevant thing here is unique shards (key prefixes) or splits.

2

2 Answers

1
votes

There is not a cost proportional to the number of unique shards. The reason for sharding is to evenly distribute traffic among splits, so whatever number of shards is required to get this even distribution is recommended.

What do you mean by entry id? If this is unique and evenly distributed it could be used as a primary key.

1
votes

The Schema Design topic discusses using logical shards to avoid hotspots. The topic says, "Note that the splits might not align with the logical shards."

Cloud Spanner creates the splits as needed as splits. There's some more information in the Schema and Data Model topic, under Database splits.