2
votes

My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.

It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.

  1. Query all records then filter results after
  2. Do a table scan

Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.

I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.

Looking at Microsoft docs, here are some key Table Storage limits (https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):

  • Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
  • Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.

My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?

Any thoughts or advice appreciated.

3
Why not chunk the request, keep getting top 100 until result is < 100? Then you can also do it on parallel processes.Frank Nielsen
how fast to you need to process? What about using a Time Trigger Azure Functions that runs every 2 minutes and query for the top N entities without property A and set them?Thiago Custodio
The process that happens after all the results are retrieved is to export it as a CSV. From my understanding we can batch records into multiple CSVs, but it's not desirable to have too many. I think 2-3 could be acceptable. My thoughts now is to use a durable function, and whether it is possible to actually fan-out the table storage query across multiple activities, then fan-in to process. But some logic may need to be put in place to prevent duplicate entities being queried? There's a time constraint so I'm trying to find a "good-enough" solution with the least amount of work possibleAaron Zhong
Update: So it's not an option to change the partition key because there could affect parts that are dependent of it.Aaron Zhong
@FrankNielsen could you clarify how you would be able to get top 100 results in parallel and not get the same 100 results for each query?Aaron Zhong

3 Answers

1
votes

I found this question after blogging on the very topic. I have a project where I am using the Azure Functions Consumption plan and have a big Azure Storage Table (3.5 million records).

Here's my blog post: https://www.joelverhagen.com/blog/2020/12/distributed-scan-of-azure-tables

I have mentioned a couple of options in this blog post but I think the fastest is distributing the "table scan" work into smaller work items that can be easily completed in the 10-minute limit. I have an implementation linked in the blog post if you want to try it out. It will likely take some adapting to your Azure Function but most of the clever part (finding the partition key ranges) is implemented and tested.

This looks to be essentially what user3603467 is suggesting in his answer.

0
votes

I see two approaches to retrieve 1+ records in a batch process, where the result must be saved to a single media - like a file.

First) You identity/select all primary id/key of related data. Then you spawn parallel jobs with chunks of these primary id/keys where you read the actual data and process it. each job then report to the single media with the result.

Second) You identity/select (for update) top n of related data, and mark this data with a state of being processed. Use concurrency locking here, that should prevent others from picking that data up if this is done in parallel.

I will go for the first solution if possible, since it is the simplest and cleanest solution. The second solution is best if you use "select for update", i dont know if that is supported on Azure Table Storage.

0
votes

You'll need to paralise the task. As you don't know the partition keys, run 24 separate queries PK that start and end for each letter of the alaphabet. Write a query where PK > A && PK < B, and > B < C etc. Then join the 24 results in memory. Super easy to do in a single function. In JS just use Promise.all([]).