We’re using CosmosDB in production to store HTTP request/response audit data. The structure of this data generally looks as follows:
{
"id": "5ff4c51d3a7a47c0b5697520ae024769",
"Timestamp": "2019-06-27T10:08:03.2123924+00:00",
"Source": "Microservice",
"Origin": "Client",
"User": "SOME-USER",
"Uri": "GET /some/url",
"NormalizedUri": "GET /SOME/URL",
"UserAgent": "okhttp/3.10.0",
"Client": "0.XX.0-ssffgg;8.1.0;samsung;SM-G390F",
"ClientAppVersion": "XX-ssffgg",
"ClientAndroidVersion": "8.1.0",
"ClientManufacturer": "samsung",
"ClientModel": "SM-G390F",
"ResponseCode": "OK",
"TrackingId": "739f22d01987470591556468213651e9",
"Response": "[ REDACTED ], <— Usually quite long (thousands of chars)
"PartitionKey": 45,
"InstanceVersion": 1,
"_rid": "TIFzALOuulIEAAAAAACACA==",
"_self": "dbs/TIFzAA==/colls/TIFzALOuulI=/docs/TIFzALOuulIEAAAAAACACA==/",
"_etag": "\"0d00c779-0000-0d00-0000-5d1495830000\"",
"_attachments": "attachments/",
"_ts": 1561630083
}
We’re currently writing around 150,000 - 200,000 of documents similar to the above a day with /PartitionKey
as the partition key path that's configured on the container. The value of the PartitionKey is a randomly generated number in C#.net between 0 and 999.
However, we are seeing daily hotspots where a single physical partition can hit a max of 2.5K - 4.5K RU/s and others are very low (around 200 RU/s). This has a knock on to cost implications as we need to provision throughput for our largest utilised partition.
The second factor is we're storing a fair bit of data, close to 1TB of documents, and we add a few GB each day. As a result we have currently have around 40 physical partitions.
Combining these two factors means we end up having to provision for at minimum somewhere between 120,000 - 184,000 RU/s.
I should mention that we barely ever need to query this data; apart from very occasional for ad-hoc manually constructed queries in Cosmos data explorer.
My question is... would we be a lot better off in terms of RU/s required and distribution of data by simply using the “id” column as our partition key (or a randomly generated GUID) - and then setting a sensible TTL so we don't have a continually growing dataset?
I understand this would require us to re-create the collection.
Thanks very much.