1
votes

I'm trying to move some data from Azure SQL Server Database to Azure Blob Storage with the "Copy Data" pipeline in Azure Data Factory. In particular, I'm using the "Use query" option with the ?AdfDynamicRangePartitionCondition hook, as suggested by Microsoft's pattern here, in the Source tab of the pipeline, and the copy operation is parallelized by the presence of a partition key used in the query itself.

The source on SQL Server Database consists of two views with ~300k and ~3M rows, respectively. Additionally, the views have the same query structure, e.g. (pseudo-code)

with 
    v as (
        select hashbyte(field1) [Key1], hashbyte(field2) [Key2]
        from Table
    )
    select * 
    from v

and so do the tables that are queried by the views. On top of this, the views query the same number of partitions with a roughly equally distributed number of rows.

The unexpected behavior - most likely due to the lack of experience from my side - of the copy operation is that it lasts much longer for the view that query fewer rows. In fact, the copy operation with ~300k rows shows a throughput of ~800 KB/s, whereas the one with ~3M rows shows a throughput of ~15MB/s (!). Lastly, the writing operation to the blob storage is pretty fast for both cases, as opposite to the reading-from-source operation.

I don't expect anyone to provide an actual solution - as the information provided is limited -, but I'd rather like some hints on what could be affecting the copy performance so badly for the case where the view queries much (roughly an order of magnitude) fewer rows, taking into account that the tables under the views have a comparable number of fields, and also the same data types: both the tables that the views query contain int, datetime, and varchar data types.

Thanks in advance for any heads up.

2
Do you read this document: docs.microsoft.com/en-us/azure/data-factory/…? it's may be helpful.Leon Yue
Yes, I did, thanks. In fact, I fiddled with several DIU/parallelCopies configurations, but no luckFabrizio Miano

2 Answers

1
votes

When there's a copy activity performance issue in ADF and the root cause is not obvious (e.g. if source is fast, but sink is throttled, and we know why) -- here's how I would go about it :

  1. Start with the Integration Runtime (IR) (doc.). This might be a jobs' concurrency issue, a network throughput issue, or just an undersized VM (in case of self-hosted). Like, >80% of all issues in my prod ETL are caused by IR-s, in one way or another.
  2. Replicate copy activity behavior both on source & sink. Query the views from your local machine (ideally, from a VM in the same environment as your IR), write the flat files to blob, etc. I'm assuming you've done that already, but having another observation rarely hurts.
  3. Test various configurations of copy activity. Changing isolationLevel, partitionOption, parallelCopies and enableStaging would be my first steps here. This won't fix the root cause of your issue, obviously, but can point a direction for you to dig in further.
  4. Try searching the documentation (this doc., provided by @Leon is a good start). This should have been a step #1, however, I find ADF documentation somewhat lacking.

N.B. this is based on my personal experience with Data Factory.
Providing a specific solution in this case is, indeed, quite hard.

1
votes

To whoever might stumble upon the same issue, I managed to find out, rather empirically, that the bottleneck was being caused by the presence of several key-hash computations in the view on SQL DB. In fact, once I removed these - calculated later on Azure Synapse Analytics (data warehouse) - I observed a massive performance boost of the copy operation.