4
votes

I'd like to evaluate how my Windows Azure Table store queries scale. For this purpose, I've put together a simple test environment, where I can increase the amount of data in my table, and measure the execution times of the queries. And based on the times I'd like to define a cost function that could be used to evaluate the performance of future queries.

I've evaluated the following queries:

  1. Query with PartitionKey and RowKey
  2. Query with PartitionKey and an attribute
  3. Query with PartitionKey and two RowKeys
  4. Query with PartitionKey and two attributes

For the last two queries I've checked the following two patterns:

  1. PartitionKey == "..." && (RowKey == "..." || RowKey == "...")
  2. (PartitionKey == "..." && RowKey == "...") || (PartitionKey == "..." && RowKey == "...")

To minimize the transfer delay, I've executed the test on an Azure instance. From the measurements, I can see that

  • query 1 (not surprisingly, as the table is indexed based on those fields) is extremely fast, it's about 10-15ms if I have about 150000 entries in the table.
  • query 2 requires a partition scan, so the execution time is increasing linearly with the stored data.
  • query 3.1 performs almost exactly as query 2. So this query is also executed with a full partition scan, which for me seems a bit odd.
  • query 4.1 is a bit more than two times slower than query 3.1. So it seems like it is evaluated with two partition scans.
  • and finally, query 3.2 and 4.2 performs almost exactly 4 times slower than query 2.

Can you explain the internals of the query/filter interpreter? Even if we accept that query 3.1 needs a partition scan, query 4.1 could also be evaluated with the same logic (and under the same time). Query 3.2 and 4.2 seems like a mystery for me. Any pointers on those?

Obviously the whole point to this is that I'd like to query distinct elements within one query to minimize cost meanwhile not losing performance. But it seems like using separate queries (with Task Parallel Library) for each element is the only real fast solution. What is the accepted way of doing this?

2

2 Answers

2
votes

With query like 3.2 and 4.2 there will be full partition scan one by one along with attributes. Query will not run in parallel even when these partitions are on two separate machines, and that's why you see such long time in execution. This is because Windows Azure does not have query optimization with the queries. It is code responsibility to write in a way so they can run in parallel.

You are right if you want to have faster performance, you would nee to run the query in parallel using Task Parallel Libraries to achieve higher performance.

1
votes

Since the details of table storage internal implementation is non-public, if you want to evaluate the performance of future queries, I would like to suggest you to check http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx for some best practices.

Best Regards,

Ming Xu.