6
votes

I'm setting up a pipeline in an Azure "Data Factory", for the purpose of taking flat files from storage and loading them into tables within an Azure SQL DB.

The template for this pipeline specifies that I need a start and end time, which the tutorial says to set to 1 day.

I'm trying to understand this. If it were a CRON job in Linux or scheduled task in Windows Server, then I'd simply tell it when to start (i.e. daily at 6am) and it would take however long it takes to complete.

This leads me to several related questions:

  • Why would I need to specify an end time?
  • What if I don't know how long it will take to run?
  • If I set it too far in the future, do I run the risk of the data pipeline not completing in a timely manner?
  • If I set it too soon, will the pipeline break?
  • Why is it hard coded as a date instead of a frequency (i.e. it says to use this format -- "2014-10-14T16:32:41Z")

I found a prior question which sheds a little light on how to do frequency instead of hard coded dates, but my questions above are still unanswered by the solution.

2

2 Answers

4
votes

The 1 day schedule is just an example to highlight the concept that you would expect 24 activity windows if the frequency is set to hourly for 1 day as shown in the example.

Why would I need to specify an end time?

You do not have to specify an end time, if you want you can have the pipeline run indefinitely. However, you might have business reasons to set an end time, like to coincide with a yearly business cycle. The overall pipeline start and end time applies to the collection of activities within it. Activities will run according to the frequency you set (hourly, daily etc.) for the activity and availability of datasets. You can also set the start time for activities, or offset or delay them (for example if you want to process yesterday's data today), or set a start date in the past to backfill data.

Why is it hard coded as a date instead of a frequency?

The reason why pipeline start and end is a date and not a frequency is because it is the overall date interval for which your pipeline is active, and the individual processing activities deal with frequency and time of how often they run.

What if I don't know how long it will take to run?

Once the activities kick off they will run to completion. If they outstrip the end date the pipeline will simply not kickoff new activities.

If I set it too far in the future, do I run the risk of the data pipeline not completing in a timely manner?

No, completing in a timely manner only has to do with your cluster size, data volume, and concurrency setting.

If I set it too soon, will the pipeline break?

See above

We provide this kind of complexity of schedule so that you can have much more flexibility in orchestrating multiple services while letting ADF manage cloud resources, rather than just kicking off a cron job. There is a lot more nuanced info about scheduling in our documentation here https://azure.microsoft.com/en-us/documentation/articles/data-factory-scheduling-and-execution/

0
votes

Why would I need to specify an end time?

In ADF1, if you are specifying Start time you must have to specify end time. if you do not specify start and end time, that's fine, you will able to deploy pipeline but activities in Pipeline will not trigger.