My table storage has approximately 1-2 million records and I have a daily job that needs needs to retrieve all the records that does not have a property A and do some further processing.
It is expected that there are about 1 - 1.5 million records without property A. I understand there are two approaches.
- Query all records then filter results after
- Do a table scan
Currently, it is using the approach where we query all records and filter in c#. However, the task is running in an Azure Function App. The query to retrieve all the results is sometimes taking over 10 minutes which is the limit for Azure Functions.
I'm trying to understand why retrieve 1 million records is taking so long and how to optimise the query. The existing design of the table is that the partition and row key are identical and is a guid - this leads me to believe that there is one entity per partition.
Looking at Microsoft docs, here are some key Table Storage limits (https://docs.microsoft.com/en-us/azure/storage/common/storage-scalability-targets#azure-table-storage-scale-targets):
- Maximum request rate per storage account: 20,000 transactions per second, which assumes a 1-KiB entity size
- Target throughput for a single table partition (1 KiB-entities): Up to 2,000 entities per second.
My initial guess is that I should use another partition key to group 2,000 entities per partition to achieve the target throughput of 2,000 per second per partition. Would this mean that 2,000,000 records could in theory be returned in 1 second?
Any thoughts or advice appreciated.