1
votes

In our service, we are using SQL Azure as the main storage, and Azure table for the backup storage. Everyday about 30GB data is collected and stored to SQL Azure. Since the data is no longer valid from the next day, we want to migrate the data from SQL Azure to Azure table every night.

The question is.. what would be the most efficient way to migrate data from Azure to Azure table?

The naive idea i came up with is to leverage the producer/consumer concept by using IDataReader. That is, first get a data reader by executing "select * from TABLE" and put data into a queue. At the same time, a set of threads are working to grab data from the queue, and insert them into Azure Table.

Of course, the main disadvantage of this approach (i think) is that we need to maintain the opened connection for a long time (might be several hours).

Another approach is to first copy data from SQL Azure table to local storage on Windows Azure, and use the same producer/consumer concept. In this approach we can disconnect the connection as soon as the copy is done.

At this point, i'm not sure which one is better, or even either of them is a good design to implement. Could you suggest any good design solution for this problem?

Thanks!

1
Is this 30 GB data from one table or many tables?Gaurav Mantri

1 Answers

0
votes

I would not recommend using local storage primarily because

  1. It is transient storage.
  2. You're limited by the size of local storage (which in turn depends on the size of the VM).
  3. Local storage is local only i.e. it is accessible only to the VM in which it is created thus preventing you from scaling out your solution.

I like the idea of using queues, however I see some issues there as well:

  1. Assuming you're planning on storing each row in a queue as a message, you would be performing a lot of storage transactions. If we assume that your row size is 64KB, to store 30 GB of data you would be doing about 500000 write transactions (and similarly 500000 read transactions) - I hope I got my math right :). Even though the storage transactions are cheap, I still think you'll be doing a lot of transactions which would slow down the entire process.
  2. Since you're doing so many transactions, you may get hit by storage thresholds. You may want to check into that.
  3. Yet another limitation is the maximum size of a message. Currently a maximum of 64KB of data can be stored in a single message. What would happen if your row size is more than that?

I would actually recommend throwing blob storage in the mix. What you could do is read a chunk of data from SQL table (say 10000 or 100000 records) and save that data in blob storage as a file. Depending on how you want to put the data in table storage, you could store the data in CSV, JSON or XML format (XML format for preserving data types if it is needed). Once the file is written in blob storage, you could write a message in the queue. The message will contain the URI of the blob you've just written. Your worker role (processor) will continuously poll this queue, get one message, fetch the file from blob storage and process that file. Once the worker role has processed the file, you could simply delete that file and the message.