Preventing duplicates in Azure table storage

Question

I want to experiment with using Azure table storage as a replacement for Azure SQL. Our application(s) are typically write-heavy (think event registration) and SQL has long been a performance bottleneck.

However, a typical entry may have duplicate restrictions on more than one set of columns. For example, a new entry cannot have a duplicate email address, but it may also not have a duplicate mailing address. I cannot use a combined key for this, since each must be checked for duplicates separately. In SQL, I typically create a unique index on email address, and then another unique index on the physical address columns. This serves my purpose from a data integrity perspective, but again, performance is the issue.

Were I to use Azure storage, I'd probably use the event ID + an hourly timestamp (UTC) as the partition key, and a unique ID for each user's partition key, e.g.:

Partition Key                  Row Key                 <columns...>
event101_201402130500          <user's unique ID>      <user data...>

I thought to include the hourly timestamp in the partition key just to make aggregate reporting easier later. Each hour I could use a batch job to request all entities for the previous hour's partition key and perform whatever aggregates are necessary, storing the result in another table for further aggregation and reporting.

But how do I prevent duplicates on the other columns? I thought about using a hash of the user's email + mailing address, but then, if either one changes, uniqueness for each individual part is not enforced, since the hash would be unique.

I thought about using a separate table to track duplicates per column, something like:

Partition Key           Row Key                  User Key
event101                email_<user's email>     <user's unique ID>
event101                addr_<hash of NACSZ>     <user's unique ID>

I would insert the duplicate entities within a single batch transaction, and if the transaction failed (because of a duplicate PK/RK), I would simply avoid inserting the user row. This should work, as long as the batch is <100 entities and <4MB (according to the ATS docs)

Overall, this seems like it would do the trick, but given my lack of experience with table storage in this manner, I want to be sure I'm not overlooking something or over-thinking the solution.

So my actual question here is: Is ATS really this simple and is my solution adequate, or should I consider other (as yet unknown to me) options?

I believe you really aren't overlooking anything. I haven't had the chance to work with batch transactions, but they sound right and I know no other way to achieve this. — Ilya Kogan
I don't know what some of your overall design considerations or constraints might be that require uniqueness on those fields, but I have an alternate suggestion. Just let the records write to storage as-is (allowing duplicates), and deal with the issue later. Later could either be through filtering when you report, or a separate job that scans the tables for duplicates and removes them. — Nathan
I wish I could do that Nathan, but the client almost always wants to notify the customer in real time if their entry was rejected because of duplicate information. — Chris

Andrew Newland Andrew Newland · Accepted Answer · 2014-05-29T14:02:00

I would suggest using batch transactions as you suggested by would do something slightly different. I'm not sure if you need the email and/or address to be unique per vent but i has assumed so.

I would have a table which will store two entries per registration of an event. One for the email and one for the address. You already thought about this too.

Then have a separate table to put the actual event registration into. Only on successful read and write to the first table to store the email and address would you then add the event into the event table.

I believe you can put this into one transaction.

Preventing duplicates in Azure table storage

1 Answers