0
votes

I like to find the best methods of transferring 20 GB of SQL data from a SQL Server database installed on a customer onsite server, Client, to our Azure SQL Server, Source, on an S4 with 200 DTUs performance for $320 a month. When doing an initial setup, we set up an Azure Data Factory that copies over the 20 GB via multiple table copies, e.g., Client Table A's content to Source Table A, Client Table B's content to Source Table B, etc. Then we run many Extractors store procedures that insert into Stage tables the data from the Source tables by joining these Source Table together, e.g., Source A join to Source B. After that is incremental copies, but the initial setup do take forever.

Currently the copying time on an S4 is around 12 hours with the extracting time to be 4 hours. Increasing the performance tier to an S9 of 1600 DTUs for $2400 a month will decrease time to 6 hours with the extracting time to be 2 hours, but that bring with it the higher cost.

I was wondering if there was other Azure methods. Does setting up an HDInsight cluster with Hadoop or Spark be more efficient for cost compare to scaling up the Azure SQL DB to an S9 and more? An S9 of $2400 a month of 31 days is $3.28 an hour. Azure HDInsight Clusters of Memorized Optimized Nodes of a D14 v2 instance is $1.496 per hour so it would be cheaper than an S9. However, how does it compare in terms of performance. Would the copying process be quicker or will the extraction process be quicker?

I am not used to Big Data methods yet. Thank you for all the help.

1
If you want to investigate a big-data method here use Azure Databricks. It has Spark, and first-class support for on-demand clusters. You could use ADF to load Azure Blob Storage, and try using Spark SQL to do the joins before loading SQL Server.David Browne - Microsoft
Is there a reason you can't scale up temporarily ie just for the period of your job to run? It's likely enabling the staging option in Data Factory as suggested by @LeonYue will make a difference - please confirm. Can you also confirm you are doing an incremental load, not a full reload every time? Do you have ExpressRoute? The extract does seem extraordinarily long for this relatively small volume.wBob
@wBob . The time were wrongs. Sorry about that, but copying time do take forever. We have switch to incremental copy now so that helps tremendously. I reword the question to be more about initial setup of a client and doing an initial copy which do take forever. Transforming the data will still take a long time since new data have to reference the old data too.Dung Tran

1 Answers

4
votes

Azure Data Factory Copy Activity delivers a first-class secure, reliable, and high-performance data loading solution. It enables you to copy tens of terabytes of data every day across a rich variety of cloud and on-premises data stores.Copy Activity offers a highly optimized data loading experience that is easy to configure and set up.

You can see the performance reference table about Copy Activity:enter image description here The table shows the copy throughput number in MBps for the given source and sink pairs in a single copy activity run based on in-house testing.

If you want the data could be transfered quicker by using Azure Data Factory Copy Activity, Azure provides three ways to achieve higher throughput:

  1. Data integration units. A Data Integration Unit (DIU) (formerly known as Cloud Data Movement Unit or DMU) is a measure that represents the power (a combination of CPU, memory, and network resource allocation) of a single unit in Data Factory. You can achieve higher throughput by using more Data Integration Units (DIU).You are charged based on the total time of the copy operation. The total duration you are billed for data movement is the sum of duration across DIUs.

  2. Parallel Copy. We can use the parallelCopies property to indicate the parallelism that you want Copy Activity to use.For each Copy Activity run, Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.

  3. Staged copy. When you copy data from a source data store to a sink data store, you might choose to use Blob storage as an interim staging store.

You can take these ways to tune the performance of your Data Factory service with Copy Activity.

For more details about Azure Data Factory Copy Activity performace, please see: https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance#data-integration-units