1
votes

We have a table in Azure Table Storage which currently has 50,000 items since it is newly implemented.

PartitionKey: DateTime value in form of string
RowKey: numeric value in form of string

We use TableQuery to generate a filter condition. The PartitionKey filter is something like: PartitionKey ge '201801240000000000' && "PartitionKey lt '201806220000000000'"

Unfortunately, we cannot use a RowKey filter because we want data between two dates.

To fetch the data for around a month, it takes around 5 seconds. And to fetch for around 3 months, it takes more time than that.

Though we have a caching strategy in place, fetching the data the first time takes a long time. Just like it takes a long time when the filter on the data changes on the date.

Any suggestions to improve performance would be appreciated.

2

2 Answers

4
votes

As far as I can see from your post, the biggest issue you have is that your query spans multiple partitions in one query. This is not optimal for performance. Based on the below list, you're somewhere between Partition Scan and Table Scan, since you are specifying the partition key, but you're using multiple of them.

  • A Point Query is the most efficient lookup to use and is recommended to be used for high-volume lookups or lookups requiring lowest latency. Such a query can use the indexes to locate an individual entity very efficiently by specifying both the PartitionKey and RowKey values. For example: $filter=(PartitionKey eq 'Sales') and (RowKey eq '2')
  • Second best is a Range Query that uses the PartitionKey and filters on a range of RowKey values to return more than one entity. The PartitionKey value identifies a specific partition, and the RowKey values identify a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and RowKey ge 'S' and RowKey lt 'T'
  • Third best is a Partition Scan that uses the PartitionKey and filters on another non-key property and that may return more than one entity. The PartitionKey value identifies a specific partition, and the property values select for a subset of the entities in that partition. For example: $filter=PartitionKey eq 'Sales' and LastName eq 'Smith'
  • A Table Scan does not include the PartitionKey and is very inefficient because it searches all of the partitions that make up your table in turn for any matching entities. It will perform a table scan regardless of whether or not your filter uses the RowKey. For example: $filter=LastName eq 'Jones'
  • Queries that return multiple entities return them sorted in PartitionKey and RowKey order. To avoid resorting the entities in the client, choose a RowKey that defines the most common sort order.

Source: Azure Storage Table Design Guide: Designing Scalable and Performant Tables

Another very useful article is this one: What PartitionKey and RowKey are for in Windows Azure Table Storage, especially when you look at the this image:

Based on the size and load of a partition, partitions are fanned out across machines. Whenever a partition gets a high load or grows in size, the Windows Azure storage management can kick in and move a partition to another machine: Partitioning

Edit:
If there are multiple ways you would like to query your data, think about storing them in multiple ways. Especially since storage is cheap, storing data multiple times is not that bad. This way you optimize for read. This is what's known as the Materialized View pattern which can "help support efficient querying and data extraction, and improve application performance".

However, you should keep in mind that this is simple for static data. If you have data that changes around a lot, keeping them in sync when storing it multiple times might become a hassle.

1
votes

rickvdbosch's answer is spot on.

Here are some additional thoughts assuming this is an application. One approach would to read smaller PartitionKey ranges in parallel. For example, assuming the range being processed is June/2018, we would have:

  • Thread-1 => PartitionKey ge '20180601' && PartitionKey lt '20180605'
  • Thread-2 => PartitionKey ge '20180605' && PartitionKey lt '20180610'
  • Thread-3 => PartitionKey ge '20180610' && PartitionKey lt '20180615'
  • Thread-4 => PartitionKey ge '20180615' && PartitionKey lt '20180620'
  • Thread-5 => PartitionKey ge '20180620' && PartitionKey lt '20180725'
  • Thread-6 => PartitionKey ge '20180625' && PartitionKey lt '20180701'

Moreover, it is possible to be even more aggressive and read smaller partitions (e.g. daily) in parallel without using the TableQuery constructs.

Note that neither approach described above handles a partitioning strategy that is highly unbalanced. For example, assume that 95% of the data for June/2018 is stored on range '20180605' to '20180610' or in a single day, there may or may not be perceived improvement on the overall execution time compared to a serial read in this case, specially because of the parallelism overhead (e.g. threads, memory allocation, synchronization, etc.).

Now, assuming this is .NET application running on Windows OS and the approach described above is appealing to your scenario, consider:

  • Increasing the max number of connections;
  • Disabling the Nagle algorithm;

Find below a code snippet to change in the application configuration. Please note that:

  • It is possible to define the address (e.g. https://stackoverflow.com) for the maxconnection instead of using "*".
  • It is recommended to run performance tests to benchmark what is the appropriate configuration for the maxconnection before releasing to production.

Find more details about the connection management at https://docs.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/network/connectionmanagement-element-network-settings.