In our Airflow installation, we use UTC, but do have some daily jobs that require changes for daylight savings time. When we exit daylight savings, this means that we must move the schedule forward one hour.
Unfortunately, this also means that these jobs immediately execute, because the scheduler sees that the job was not executed in the last 24 hours, so it must be time for another run.
I am aware that we could set the DAG start date to prevent this initial run. Is there any other way to accomplish changing the schedule but waiting for the next interval to run the job?
We have a similar issue with creating monthly or weekly jobs. Is the DAG start date the right way to handle these?
Further, if it is, how should the start date be set?
For example, if I have a job that is set to '0 4 * * *' and then I change it to '0 5 * * *' then if I set the start date to 11/5/2020, will it execute at 5am on 11/5 or will it wait for the first complete execution interval after the start date and run at 5am on 11/6?
1 Answers
Changing the schedule interval of airflow is not recommended according to their official confluence space and you should create a new dag_id instead:
When needing to change your start_date and schedule interval, change the name of the dag (a.k.a. dag_id) - I follow the convention : my_dag_v1, my_dag_v2, my_dag_v3, my_dag_v4, etc...
Changing schedule interval always requires changing the dag_id, because previously run TaskInstances will not align with the new schedule interval.
Changing start_date without changing schedule_interval is safe, but changing to an earlier start_date will not create any new DagRuns for the time between the new start_date and the old one, so tasks will not automatically backfill to the new dates. If you manually create DagRuns, tasks will be scheduled, as long as the DagRun date is after both the task start_date and the dag start_date.
If you want to always schedule your DAG according to a local time, you can specify a timezone tzinfo within the start date. The Following DAG would always run 4:30 local time, regardless of summer and winter time.
from datetime import datetime, timedelta
from pendulum import timezone
import pendulum
default_args = {
'depends_on_past': False,
'wait_for_downstream': False,
'start_date': datetime(2020, 7, 16, tzinfo=timezone('Europe/Berlin')),
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=5),
'sla': timedelta(hours=1)
}
# Set Schedule
SCHEDULE_INTERVAL = '40 3 * * *'
# Define DAG
dag_audit_query_logs = DAG('local_tz_dag', default_args=default_args,
catchup=False,
max_active_runs=3,
schedule_interval=SCHEDULE_INTERVAL)