Azure storage tables / transactions and duplicate values

votes

I need to insert some data in bulk into an azure storage table. As required all entities share the same partition key.

The thing though is that some entities (PK+RK combination) may be already present in the destination table.

What i understand is that either the entire transaction succeed of fail during a transaction, so my question is what happens if while inserting these entities as a transaction there are some duplicates?

Will the whole thing fail?

Is there any way to prevent this from failing without checking entity by entity?

Thanks. Happy new year!

azureazure-storageazure-table-storage

3 Answers

votes

Have you seen the new Upsert behavior? This might be a good case for it. If you just want to overwrite existing entities, you can use the InsertOrReplace entity (or InsertOrMerge operation if you do care). This will ignore the errors on collision and use either Merge or Replace operations.

votes

Using SDK 1.8 I tried using InsertOrReplace as well as InsertOrMerge. I also tried setting the ETag = "*". Each approach returned the following error when I tried to execute the batch operation that included a duplicate entity (PartitionKey/RowKey):

Unexpected response code for operation : 0

HResult: -2146233088

After digging deep, the core error was:

InvalidInput

1:One of the request inputs is not valid.

Accoring to this, an entity can appear only once in the transaction, and only one operation may be performed against it.

The resolution for us was to remember the previous RowKeys in the batch transaction and handle the duplicates appropriately such that we added only one operation per entity for the batch transaction. In our case, it was safe to omit the duplicates from the batch transaction.

votes

Unfortunately your batch will succeed or fail in a atomic fashion. There is no way to ignore errors for just those operations that fail.

What you probably want to do is to implement some intelligent error handling here. Your issue is that a-priori checking for duplicates will be very expensive because there is no batch GET operation (OK, so strictly speaking there is support; but only for one query per batch). My initial thought is that the most efficient way to deal with this would be to take a failed batch and basically binary tree search it.

Proposed Approach To Handle

Take your failed batch and split it in half; so if you had a batch of 100 operations you'd end up with two batches of 50 operations. Execute these two batches. Keep splitting each batch that fails and eliminating batches that have suceeded. You could probably write this as a reasonably efficient and parallelizable algorithm by modelling your entire dataset as a single 'batch' and having a rule of both maxbatchsize=100 and then splitting. Each batch can be executed independently of the others; becasue you'll just ignore duplicates it doesn't matter which copy of the dupe row is inserted first.

Others may like to chime in, but, I would think this gives you the most efficeint way to insert your data ignoring duplicates.

Other option would be to de-dupe the data before it hits Azure Table Storage, but, would probably want to know about the total number of rows and relative duplicate frequency before comments on whether that is a better approach.