6
votes

As the title implies, looking to understand the difference between catchup = False in dag definition and the latest only operator.

https://airflow.apache.org/docs/stable/scheduler.html https://airflow.apache.org/docs/stable/_modules/airflow/operators/latest_only_operator.html

1

1 Answers

11
votes

Well, they are, I would say, totally different concepts, and they can be used independently. It is true that they could both be used to prevent backfilling, but if that's your only concern then just use catchup=False. Quoting from this reply by one of the Airflow developers, in fact, it seems clear that the good practice is to use that:

As the author of LatestOnlyOperator, the goal was as a stopgap until catchup=False landed.

But he then goes on saying that LatestOnlyOperator should be deprecated. I don't agree (as a user of both catchup=False and LatestOnlyOperator) and I'll try to explain. My intuition of these two concepts is this:


Catchup = True

In a DAG definition (i.e. when you specify its default_args) you can set the flag catchup to True. If you set this flag to True and you set the DAG to ON, then the scheduler will create DAG runs for each schedule interval from the start_date to the "present" and will execute them sequentially. Quoting the documentation:

If the dag.catchup value had been True instead, the scheduler would have created a DAG Run for each completed interval between 2015-12-01 and 2016-01-02 (but not yet one for 2016-01-02, as that interval hasn’t completed) and the scheduler will execute them sequentially.


LatestOnlyOperator

A LatestOnlyOperator is an extention of the BaseOperator. Tasks made with this Operator will not run (i.e. will be skipped, and will skip also the downstream ones) if the DAG run is not in the latest schedule interval (i.e. the "last run"). Also quoting from the LatestOnlyOperator docstring:

"""
Allows a workflow to skip tasks that are not running during the most
recent schedule interval.

If the task is run outside of the latest schedule interval, all
directly downstream tasks will be skipped.

Note that downstream tasks are never skipped if the given DAG_Run is
marked as externally triggered.
"""

Conclusion

You can define your scheduled DAG with catchup=True and use LatestOnlyOperator to make sure that some tasks will not be executed during the catchup runs. Moreover LatestOnlyOperator can be used if you want to re-run some past DAG runs (for example by clearing them in the UI) but you have some tasks (like notifications being sent) that you would want to skip during those re-runs.