1
votes

From https://airflow.apache.org/scheduler.html :

Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.

This feature is very hurt .

For example I have etl job which run every day, schedule_interval is 0 1 * * *, so it will trigger 2019-09-22 01:00:00 job on 2019-09-23 01:00:00 . But my etl is processing all data before start_date , means data range is between (history, 2019-09-23 00:00:00) , and we can't use datetime.now() because this is unable to reproduce. This force me add 1 day to execution_date:

etl_end_time = "{{ (execution_date + macros.timedelta(days=1)).strftime('%Y-%m-%d 00:00:00') }}"

However, when I need run a job with schedule_interval 45 2,3,4,5,6 * * * , the 2019-09-22 06:45:00 job would run on 2019-09-23 02:45:00, which is one day after execution_date (next execution time) . Instead of adding a day, I had to changed schedule_interval 45 2,3,4,5,6,7 * * * and put a dummy operator on last run. And in this situation , you don't need add one day to execution_date , this means you have to define two etl_end_time to represent a same date in jobs with different schedule_interval .

All these are very uncomfortable for me , is there any config or built-in method to make execution_date equal to start_date ? Or I have to modify airflow source code ...

2

2 Answers

1
votes

for a scheduled run, next_execution_date will return the exact time when it triggered.

0
votes

I found there's a pr https://github.com/apache/airflow/pull/5787

This change introduces the attribute schedule_interval_edge, a string containing either 'start' or 'end', to DAGs. The scheduler uses the value to determining if a DAG should be scheduled at the start or the end of the schedule interval.

A parameter with the same name was also added to the default_airflow.cfg in the [scheduler] section.

I have taked codes in this pr.