5
votes

So here is the problem. I need to update about 40 million entities in an azure table. Doing this with a single instance (select -> delete original -> insert with new partitionkey) will take until about Christmas.

My thought is use an azure worker role with many instances running. The problem here is the query grabs the top 1000 records. That's fine with one instance but with 20 running their selects will obviously overlap.. a lot. this would result in a lot of wasted compute trying to delete records that were already deleted by another instance and updating a record that has already been updated.

I've run through a few ideas, but the best option I have is to have the roles fill up a queue with partition and row keys then have the workers dequeue and do the actual processing?

Any better ideas?

3

3 Answers

5
votes

Very interesting question!!! Extending @Brian Reischl's answer (and a lot of it is thinking out loud, so please bear with me :))

Assumptions:

  • Your entities are serializable in some shape or form. I would assume that you'll get raw data in XML format.
  • You have one separate worker role which is doing all the reading of entities.
  • You know how many worker roles would be needed to write modified entities. For the sake of argument, let's assume it is 20 as you mentioned.

Possible Solution:

First you will create 20 blob containers. Let's name them container-00, container-01, ... container-19.

Then you start reading entities - 1000 at a time. Since you're getting raw data in XML format out of table storage, you create an XML file and store those 1000 entities in container-00. You fetch next set of entities and save them in XML format in container-01 and so on and so forth till the time you hit container-19. Then the next set of entities go into container-00. This way you're evenly distributing your entities across all the 20 containers.

Once all the entities are written, your worker role for processing these entities would come into picture. Since we know that instances in Windows Azure are sequentially ordered, you get instance names like WorkerRole_IN_0, WorkerRole_IN_1, ... and so on.

What you would do is take the instance name, get the number "0", "1" etc. Based on this you would determine which worker role instance will read from which blob container...WorkerRole_IN_0 will read files from container-00, WorkerRole_IN_1 will read files from container-01 and so on.

Now your individual worker role instance will read the XML file, create the entities from that XML file, update those entities and save it back into table storage. Once this process is done, you would then delete the XML file and you move on to next file in that container. Once all files are read and processed, you can just delete the container.

As I said earlier, this is a lot "thinking out loud" kind of solution and some things must be considered like what happens when "reader" worker role goes down and other things.

3
votes

If your PartitionKeys and/or RowKeys fall into a known range, you could attempt to divide them into disjoint sets of roughly equal size for each worker to handle. eg, Worker1 handles keys starting with 'A' through 'C', Worker2 handles keys starting with 'D' through 'F', etc.

If that's not feasible, then your queuing solution would probably work. But again, I would suggest that each queue message represent a range of keys if possible. eg, a single queue message specifies deleting everything in the range 'A' through 'C', or something like that.

In any case, if you have multiple entities in the same PartitionKey then use batch transactions to your advantage for both inserting and deleting. That could cut down the number of transactions by almost a factor of ten in the best case. You should also use parallelism within each worker role. Ideally use the async methods (either Begin/End or *Async) to do the writing, and run several transactions (12 is probably a good number) in parallel. You can also run multiple threads, but that's somewhat less efficient. In either case, a single worker can push a lot of transactions with table storage.

As a side note, your process should go "Select -> Insert New -> Delete Old". Going "Select -> Delete Old -> Insert New" could result in permanent data loss if a failure occurs between steps 2 & 3.

2
votes

I think you should mark your question as the answer ;) I cant think of a better solution since I don't know what your partition and row keys look like. But to enhance your solution, you may choose to pump multiple partition/row keys into each queue message to save on transaction cost. Also when consuming from the queue, get them in batches of 32. Process asynchronously. I was able to transfer 170 million records from SQL server (Azure) to Table storage in less than a day.