14
votes

I'm testing the use of Airflow, and after triggering a (seemingly) large number of DAGs at the same time, it seems to just fail to schedule anything and starts killing processes. These are the logs the scheduler prints:

[2019-08-29 11:17:13,542] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:13,544] {scheduler_job.py:214} WARNING - Killing PID 199809
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:17:44,614] {scheduler_job.py:214} WARNING - Killing PID 2992
[2019-08-29 11:18:15,692] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:15,693] {scheduler_job.py:214} WARNING - Killing PID 5174
[2019-08-29 11:18:46,765] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:18:46,766] {scheduler_job.py:214} WARNING - Killing PID 22410
[2019-08-29 11:19:17,845] {scheduler_job.py:214} WARNING - Killing PID 42177
[2019-08-29 11:19:17,846] {scheduler_job.py:214} WARNING - Killing PID 42177
...

I'm using a LocalExecutor with a PostgreSQL backend DB. It seems to be happening only after I'm triggering a large number (>100) of DAGs at about the same time using external triggering. As in:

airflow trigger_dag DAG_NAME

After waiting for it to finish killing whatever processes he is killing, he starts executing all of the tasks properly. I don't even know what these processes were, as I can't really see them after they are killed...

Did anyone encounter this kind of behavior? Any idea why would that happen?

2
What's your concurrency setting for the dag? - Chengzhi
Do you mean the max active runs per dag? The settings there are quite unclear as to what they affect, and online as well it's unclear.. Is there a specific setting I should Iook at? - GuD
Maybe it's easier if you can share the dag file? Default is 16 concurrency task, but you can bump it up. github.com/apache/airflow/blob/master/airflow/models/… - Chengzhi
We seem to be experiencing a similar issue since upgrading to Airflow 10.5, but we haven't been able to get to the bottom of it. What version of Airflow are you running? - Louis Simoneau
@LouisSimoneau what version does not have the issue? - tooptoop4

2 Answers

5
votes

The reason for the above in my case was that I had a DAG file creating a very large number of DAGs dynamically.

The "dagbag_import_timeout" config variable which controls "How long before timing out a python file import while filling the DagBag" was set to the default value of 30. Thus the process filling the DagBag kept timing out.

4
votes

I've had a very similar issue. My DAG was of the same nature (a file that generates many DAGs dynamically). I tried the suggested solution but it didn't work (had this value to some high already, 60 seconds, increased to 120 but my issue wasn't resolved).

Posting what worked for me in case someone else has a similar issue.

I came across this JIRA ticket: https://issues.apache.org/jira/browse/AIRFLOW-5506

which helped me resolve my issue: I disabled the SLA configuration, and then all my tasks started to run!

There can also be other solutions, as other comments in this ticket suggest.

For the record, my issue started to occur after I enabled lots of such DAGs (around 60?) that I had disabled for a few months. Not sure how the SLA affects this from technical perspective TBH, but it did.