Data Factory Copy Activity Blob -> ADLS

Question

I have files that accumulate in Blob Storage on Azure that are moved each hour to ADLS with data factory... there are around 1000 files per hour, and they are 10 to 60kb per file...

what is the best combination of:

"parallelCopies": ?
"cloudDataMovementUnits": ?

and also,

"concurrency": ?

to use?

currently i have all of these set to 10, and each hourly slice takes around 5 minutes, which seems slow?

could ADLS, or Blob be getting throttled, how can i tell?

frictionlesspulley frictionlesspulley · Accepted Answer · 2017-10-24T16:51:36

There won't be a one solution fits all scenarios when it comes to optimizing a copy activity. However there few things you can checkout and find a balance. A lot of it depends on the pricing tiers / type of data being copied / type of source and sink.

I am pretty sure that you would have come across this article.

https://docs.microsoft.com/en-us/azure/data-factory/copy-activity-performance

this is a reference performance sheet, the values are definitely different depending on the pricing tiers of your source and destination items.

Parallel Copy :

This happens at the file level, so it is beneficial if your source files are big as it chunks the data (from the article)
Copy data between file-based stores Between 1 and 32. Depends on the size of the files and the number of cloud data movement units (DMUs) used to copy data between two cloud data stores, or the physical configuration of the Self-hosted Integration Runtime machine.
The default value is 4.
behavior of the copy is important. if it is set to mergeFile then parallel copy is not used.

Concurrency :

This is simply how many instances of the same activity you can run in parallel.

Other considerations :

Compression :

Codec
Level

Bottom line is that you can pick and choose the compression, faster compression will increase network traffic, slower will increase time consumed.

Region :

the location or region of that the data factory, source and destination might affect performance and specially the cost of the operation. having them in the same region might not be feasible all the time depending on your business requirement, but definitely something you can explore.

Specific to Blobs

https://docs.microsoft.com/en-us/azure/storage/common/storage-performance-checklist#blobs

this article gives you a good number of metrics to improve performance, however when using data factory i don't think there is much you can do at this level. You can use the application monitoring to check out throughput while your copy is going on.