From https://airflow.apache.org/scheduler.html :
Note that if you run a DAG on a schedule_interval of one day, the run stamped 2016-01-01 will be trigger soon after 2016-01-01T23:59. In other words, the job instance is started once the period it covers has ended.
This feature is very hurt .
For example I have etl job which run every day, schedule_interval is 0 1 * * *
, so it will trigger 2019-09-22 01:00:00 job on 2019-09-23 01:00:00 . But my etl is processing all data before start_date , means data range is between (history, 2019-09-23 00:00:00) , and we can't use datetime.now()
because this is unable to reproduce. This force me add 1 day to execution_date:
etl_end_time = "{{ (execution_date + macros.timedelta(days=1)).strftime('%Y-%m-%d 00:00:00') }}"
However, when I need run a job with schedule_interval 45 2,3,4,5,6 * * *
, the 2019-09-22 06:45:00
job would run on 2019-09-23 02:45:00
, which is one day after execution_date (next execution time) . Instead of adding a day, I had to changed schedule_interval 45 2,3,4,5,6,7 * * *
and put a dummy operator on last run.
And in this situation , you don't need add one day to execution_date , this means you have to define two etl_end_time
to represent a same date in jobs with different schedule_interval .
All these are very uncomfortable for me , is there any config or built-in method to make execution_date equal to start_date ? Or I have to modify airflow source code ...