We have recently updated our logging to use Azure table storage, which owing to its low cost and high performance when querying by row and partition is highly suited to this purpose.
We are trying to follow the guidelines given in the document Designing a Scalable Partitioning Strategy for Azure Table Storage. As we are making a great number of inserts to this table (and hopefully an increasing number, as we scale) we need to ensure that we don't hit our limits resulting in logs being lost. We structured our design as follows:
We have a Azure storage account per environment (DEV, TEST, PROD).
We have a table per product.
We are using a TicksReversed+GUID for the Row Key, so that we can query blocks of results between certain times with a high performance.
We originally chose to partition the table by Logger, which for us were broad areas of the product such as API, Application, Performance and Caching. However, due to the low numbers of partitions we were concerned that this resulted in so-called "hot" partitions where many inserts were performed on one partition in a given time period. So we changed to partition on Context (for us, the class name or API resource).
However, in practice we have found this is less than ideal, because when we look at our logs at a glance we would like them to appear in order of time. We instead end up with blocks of results grouped by context, and we would have to get all partitions if we want to order them by time.
Some ideas we had were
use blocks of time (say 1 hour) for partition keys to order them by time (results in hot partitions for 1 hour)
use a few random GUIDs for partition keys to try to distribute the logs (we lose the ability to query quickly on features such as Context).
As this is such a common application of Azure table storage, there must be some sort of standard procedure. What is the best practice for partitioning Azure tables that are used for storing logs?
Solution constraints
Use cheap Azure storage (Table Storage seems the obvious choice)
Fast, scalable writes
Low chance of lost logs (i.e. by exceeding the partition write rate of 2000 entities per second in Azure table storage).
Reading ordered by date, most recent first.
If possible, to partition on something that would be useful to query (such as product area).