I would like to retain the full history of database writes on a specific table. The main usage of this table is to read the latest data, but the full auditability of all inserts and updates is also the business requirement. The official Spanner document here calls out on schema anti-patterns, and one of them is about monotonically increasing data being used as a primary key. It touches on changing the order of primary key can spread the load, and also suggests the use of hashing, shard with modulo, UUID, etc.
This Google Cloud blog post mentions the use of ShardId
instead of a timestamp hash is preferred.
Note, however, that using a simple hash will make queries by timestamp range extremely slow, since retrieving a range of timestamps will require a full table scan to cover all the hashes. Instead, we recommend generating a ShardId from the timestamp.
The example table setup is provided, with query using the TimestampShardId
.
TimestampShardId = CRC32(Timestamp) % 100
CREATE TABLE Events (
TimestampShardId INT64 NOT NULL
Timestamp TIMESTAMP NOT NULL,
event_info...
) PRIMARY KEY (TimestampShardId, Timestamp DESC)
Select * from Events
WHERE
TimestampShardId BETWEEN 0 AND 99
AND Timestamp > @lower_bound
AND Timestamp < @upper_bound;
I'm failing to understand how this TimestampShardId
makes the scan faster than simple hashing. Both approaches seem to need to scan the entire table - could anyone walk me through why ShardId
is preferred? For example, in order to get the full history, is having a history table with a hash of timestamp as primary key going to cause issues? What about primary keys with UUID and timestamp?