How to estimate Windows Azure Table storage query performance?

Question

I'd like to evaluate how my Windows Azure Table store queries scale. For this purpose, I've put together a simple test environment, where I can increase the amount of data in my table, and measure the execution times of the queries. And based on the times I'd like to define a cost function that could be used to evaluate the performance of future queries.

I've evaluated the following queries:

Query with PartitionKey and RowKey
Query with PartitionKey and an attribute
Query with PartitionKey and two RowKeys
Query with PartitionKey and two attributes

For the last two queries I've checked the following two patterns:

PartitionKey == "..." && (RowKey == "..." || RowKey == "...")
(PartitionKey == "..." && RowKey == "...") || (PartitionKey == "..." && RowKey == "...")

To minimize the transfer delay, I've executed the test on an Azure instance. From the measurements, I can see that

query 1 (not surprisingly, as the table is indexed based on those fields) is extremely fast, it's about 10-15ms if I have about 150000 entries in the table.
query 2 requires a partition scan, so the execution time is increasing linearly with the stored data.
query 3.1 performs almost exactly as query 2. So this query is also executed with a full partition scan, which for me seems a bit odd.
query 4.1 is a bit more than two times slower than query 3.1. So it seems like it is evaluated with two partition scans.
and finally, query 3.2 and 4.2 performs almost exactly 4 times slower than query 2.

Can you explain the internals of the query/filter interpreter? Even if we accept that query 3.1 needs a partition scan, query 4.1 could also be evaluated with the same logic (and under the same time). Query 3.2 and 4.2 seems like a mystery for me. Any pointers on those?

Obviously the whole point to this is that I'd like to query distinct elements within one query to minimize cost meanwhile not losing performance. But it seems like using separate queries (with Task Parallel Library) for each element is the only real fast solution. What is the accepted way of doing this?

AvkashChauhan AvkashChauhan · Accepted Answer · 2012-05-30T06:27:59

With query like 3.2 and 4.2 there will be full partition scan one by one along with attributes. Query will not run in parallel even when these partitions are on two separate machines, and that's why you see such long time in execution. This is because Windows Azure does not have query optimization with the queries. It is code responsibility to write in a way so they can run in parallel.

You are right if you want to have faster performance, you would nee to run the query in parallel using Task Parallel Libraries to achieve higher performance.

How to estimate Windows Azure Table storage query performance?

2 Answers