4
votes

We have a long dag (~60 tasks), and quite frequently we see a dagrun for this dag in a state of failed. When looking at the tasks in the DAG they are all in a state of either success or null (i.e. not even queued yet). It appears that the dag has got into a state of failed prematurely.

Under what circumstances can this happen, and what should people do to protect against it?

If it's helpful for context we're running Airflow using the Celery executor and currently running on version 1.9.0. If we set the state of the dag in question back to running then all the tasks (and the dag as a whole) complete successfully.

1
How long does this DAG usually run when this appears? Do you have any log information for this specific point in time? - tobi6
Do you have by any chance a timeout defined at the DAG level or for some tasks? - Antoine Augusti
No timeouts for the dag, but some timeouts for the tasks. The whole dag normally takes around an hour to run. I've got tons of logs but I'm not sure exactly what to look for. I didn't know a dag could fail without a task failing, so what I'd love to know is how that can occur (or at least what conditions would have to be true). I'm hoping that could point me to which logs to look for. When dags fail do they give a reason, and if so - where would I find that recorded? - Slipstream

1 Answers

7
votes

The only way that a DAG can fail without a task failing is through something not connected to any of the tasks. Besides manual intervention (check that nobody on the team is manually failing the dags!) the only thing that fails DAGs outside of considering task states is the timeout checker.

This runs inside the scheduler, while considering whether it needs to schedule a new dag_run. If it finds another active run, which has been running longer than the dagrun_timeout argument of the DAG, then it will get killed. As far as I can see this isn't logged anywhere, so the best way to diagnose this is to look at the time that the DAG started and the time that the last task finished to see if it's roughly the length of dagrun_timeout.

You can see the code in action here: https://github.com/apache/incubator-airflow/blob/e9f3fdc52cb53f3ac3e9721e5128d17d1c5c418c/airflow/jobs.py#L800