I have an Airflow cluster made of 3 worker nodes with CeleryExecutor and RabbitMQ for communication. My DAGs usually consist of tasks that download files, unzip them, upload them to hadoop, etc. So they depend on each other and must run on single machine/node.
When airflow schedules these tasks of a single DAG, on different nodes, I end up with errors because these tasks are scheduled on different machines but I need all tasks within a DAG to be scheduled on a single machine.
I tried setting dag_concurrency = 1 and max_active_runs_per_dag = 1 both on airflow.cfg and when initializing the dag, DAG(concurrency = 1, max_active_runs = 1) with no success.
Rest of my airflow.cfg:
parallelism = 32
dag_concurrency = 1
worker_concurrency = 16
max_active_runs_per_dag = 16
As far as I understand, setting dag_concurrency to 1 should do the trick but what am I missing here?