3
votes

I have been reading how to partition my Azure Table Storage to ensure high performance. I would like to know if my proposed strategy provides the ability to provide efficient and scalable inserts and simple queries to the data store.

I have 1000 different processes uploading a small packet of data ( ~50 bytes ) to AZT every 30 seconds. My queries will virtually always be to simply query by process and time. For example, I want to query for all of process A's logs from 7pm to 9pm on a given date.

My proposed strategy is to create a table for each process ( 1000 tables ) and then partition the rows such that each partition contains 6 hours of data ( 4 new partitions per day, 720 rows per partition ). Partition key 'NOV82012-0' would contain 720 rows from midnight on November 8 until 6AM. 'NOV82012-1' would contain 6AM-Noon, etc...

This should ensure that I always have fewer than 1000 rows in any partition so that I don't have to worry about continuation tokens. I can also easily 'filter' by process since the data from each process has its own table.

Is this the ideal strategy for this case? Am I missing anything?

2

2 Answers

2
votes

Actually, you don't need to worry about continuation tokens if you're using the .NET SDK. By calling AsTableServiceQuery() on a query, you'll get an object that automatically handles continuation tokens.

Based on what you're saying you want to filter on a few criteria:

  • Process
  • Date
  • Time

I don't really see a need to create 1 table per process. You could partition it with a combined key: Process+Date. An example:

  • A_20121108
  • A_20121109
  • B_20121108

By combining the process name with the date you can stick to a single table, just to make things easier. Now about the rows, it's OK to have more than 1000 items per partition. The advantage of having all rows for a given day in the same partition is that you can easily select a range in that partition based on the row key (this is semi-pseudo code, didn't test it - you might want to improve the rowkeys).

from item in context.CreateQuery<XXX>("XXX") 
where item.PartitionKey == "A_20121108" && item.RowKey.CompareTo("20121108120000") >= 0 && item.RowKey.CompareTo("20121108193000") <= 0
select item;
1
votes

I agree with Sandrino’s suggestion to go with a single table for all the processes.

One thing ATS does not do too well is support deletes. With this in mind, I suggest partitioning by time range at the table level. This way you can delete the table once you do not need the data for that time range.

A keying structure could then be

Table Name = Prefix + YYYYMM (Year & Month)
Example Process201211

PKey = Process + DDHHMM (Day of month, hour & minutes)
Example A081834, B122359 etc

RKey = Seconds or Sub-seconds.
If you cannot guarantee uniqueness with sub-seconds, consider appending a GUID