5
votes

I have a data factory with multiple pipelines and each pipeline has around 20 copy activities to copy azure tables between 2 storage accounts.

Each pipeline handles a snapshot of each azure table hence i want to run pipelines sequentially to avoid the risk of overwriting latest data with old data.

I know that giving first pipeline output as input to the 2nd pipeline we can achieve this. But as i have many activities in a pipeline, i am not sure which activity will complete last.

Is there anyway i can know that pipeline is completed or anyway one pipeline completed status triggers the next pipeline ?

In Activity, inputs is an array. So is it possible to give multiple inputs ? If yes all inputs will run asynchronously or one after the other ?

In the context of multiple inputs i have read about Scheduling dependency. So can an external input act as scheduling dependency or only internal dataset ?

3
this was almost one year ago, did you ever find an answer or workoarund to this?dim_user

3 Answers

4
votes

This is an old one but i was still having this issue with datafactory 2 so in case anyone has come here looking for this solution on datafactory 2. The “Wait on completion” tick box setting is hidden under the 'Advanced' part of the Settings tab for the Execute Pipeline activity. Just check it to get the desired result.

Note the 'Advanced' bit on the setting tab is not the same as the 'Advanced' free coding tab. See screen shot:

enter image description here

2
votes

I think currently you have a couple of options to dealing with this. Neither are really ideal, but nothing in ADF is ideal in its current form! So...

Option 1

Enforce a time slice delay or offset on the second pipeline activities. A delay would be easier to change without re-provisioning slices and can be added to an activity. This wouldn't be event driven, but would give you a little more control to avoid overlaps.

"policy": {
    "timeout": "1.00:00:00",
    "delay": "02:00:00",  // <<<< 2 hour delay
    "concurrency": 1,

Check this page for more info on both attributes and where to use them: https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution

Option 2

Break at the PowerShell and use something at a higher level to control this.

For example, use Get-​Azure​Rm​Data​Factory​Activity​Window to check the first pipelines state. Then if complete use Set-AzureRmDataFactorySliceStatus to update the second pipelines datasets to ready.

OR

Do this at a pipeline level with Suspend-​Azure​Rm​Data​Factory​Pipeline

More info on ADF PowerShell cmdlets here: https://docs.microsoft.com/en-gb/powershell/module/azurerm.datafactories/Suspend-AzureRmDataFactoryPipeline?view=azurermps-4.0.0

As I say, neither options are ideal and you've already mentioned dataset chaining in your question.

Hope this helps.

0
votes

A pipeline is completed after all the output datasets of that pipeline are in state Ready (that happens when a pipeline finishes successfully).

Furthermore, a pipeline can have multiple datasets from multiple pipelines as input (output also). In this case, a pipeline will start only after all the previous pipelines finish successfully. If you have more pipelines' datasets as input, they will run asynchronously, depending on their schedule.

External datasets (inputs) act as scheduling dependency, because they can have their own (possibly different) availability.

In the context of multiple inputs i have read about Scheduling dependency. So can an external input act as scheduling dependency or only internal dataset ?