4
votes

I'm trying to bulk load about 25 million rows from an Azure SQL table into three different tables in Azure Table Storage. I'm currently managing to process about 50-100 rows / second, which means that at current speeds, it'll take me about 70-140 hours to finish the load. That's a long time, and it seems like it ought to be possible to speed that up.

Here's what I'm doing:

  • Kick off 10 separate tasks
  • For each task, read the next 10,000 unprocessed records from the SQL DB
  • For each of the three destination ATS tables, group the 10,000 records by that table's partition key
  • In parallel (up to 10 simultaneously), for each partition key, segment the partition into (max) 100-row segments
  • In parallel (up to 10 simultaneously), for each segment, create a new TableBatchOperation.
  • For each row from the chunk, execute a batch.InsertOrReplace() statement (because some of the data has already been loaded, and I don't know which)
  • Execute the batch asynchronously
  • Rinse and repeat (with lots of flow control, error checking, etc.)

Some notes:

  • I've tried this several different ways, with lots of different parameters for the various numbers up above, and I'm still not getting it down to less than 10-20 ms / event.
  • It doesn't seem to be CPU bound, as the VM doing the load is averaging about 10-20% CPU.
  • It doesn't seem to be SQL-bound, as the SQL select statement is the fastest part of the operation by at least two orders of magnitude.
  • It's presumably not network-bound, as the VM executing the batch is in the same data center (US West).
  • I'm getting reasonable partition density, i.e., each 10K set of records is getting broken up into a couple hundred partitions for each table.
  • With perfect partition density, I'd have up to 3000 tasks running simultaneously (10 master tasks * 3 tables * 10 partitions * 10 segments). But they're executing asynchronously, and they're nearly all I/O bound (by ATS), so I don't think we're hitting any threading limits on the VM executing the process.

The only other obvious idea I can come up with is one that I tried earlier on, namely, to do an order by partition key in the SQL select statements, so that we can get perfect partition density for the batch inserts. For various reasons that has proven to be difficult, as the table's indexes aren't quite setup for that. And while I would expect some speed up on the ATS side using that approach, given that I'm already grouping the 10K records by their partition keys, I wouldn't expect to get that much additional performance improvement.

Any other suggestions for speeding this up? Or is this about as fast as anybody else has been able to get?

2
This link should give you information for what you are looking for leiliweb.wordpress.com/2012/12/11/…J. Davidson
@J.Davidson - Thanks for the suggestion, but that talks about partitioning strategies for SQL Server, and I'm having trouble with Azure Table Storage. Rather different beast :-).Ken Smith

2 Answers

1
votes

Still open to other suggestions, but I found this page here quite helpful:

http://blogs.msmvps.com/nunogodinho/2013/11/20/windows-azure-storage-performance-best-practices/

Specifically, these:

ServicePointManager.Expect100Continue = false;
ServicePointManager.UseNagleAlgorithm = false;
ServicePointManager.DefaultConnectionLimit = 100;

With those, I was able to drop the average processing time from ~10-20 ms / event down to ~2 ms. Much better.

But as I said, still open to other suggestions. I've read about other folks getting upwards of 20,000 operations per second on ATS, and I'm still stuck around 500.

0
votes

What about your partition keys? If they are incremental numbers, then Azure will optimize them into one storage node. So you should use completly different partition keys "A1", "B2" etc. instead of "1", "2" etc. In this situation all of your partitions will be handled by different storage nodes, and performance will be multitplied.