Retrieve 1+ million records from Azure Table Storage

Question

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.

It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.

Query all records then filter results after
Do a table scan

Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.

I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.

Looking at Microsoft docs, here are some key Table Storage limits (https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):

Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.

My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?

Any thoughts or advice appreciated.

Why not chunk the request, keep getting top 100 until result is < 100? Then you can also do it on parallel processes. — Frank Nielsen
how fast to you need to process? What about using a Time Trigger Azure Functions that runs every 2 minutes and query for the top N entities without property A and set them? — Thiago Custodio
The process that happens after all the results are retrieved is to export it as a CSV. From my understanding we can batch records into multiple CSVs, but it's not desirable to have too many. I think 2-3 could be acceptable. My thoughts now is to use a durable function, and whether it is possible to actually fan-out the table storage query across multiple activities, then fan-in to process. But some logic may need to be put in place to prevent duplicate entities being queried? There's a time constraint so I'm trying to find a "good-enough" solution with the least amount of work possible — Aaron Zhong
Update: So it's not an option to change the partition key because there could affect parts that are dependent of it. — Aaron Zhong
@FrankNielsen could you clarify how you would be able to get top 100 results in parallel and not get the same 100 results for each query? — Aaron Zhong

Joel Verhagen Joel Verhagen · Accepted Answer · 2020-12-31T19:24:10

I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).

Here's my blog post: https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables

I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.

This looks to be essentially what user3603467 is suggesting in his answer.

Retrieve 1+ million records from Azure Table Storage

3 Answers