Problem
Airflow tasks of the type DataflowTemplateOperator
take a long time to complete. This means other tasks can be blocked by it (correct?).
When we run more of these tasks, that means we would need a bigger Cloud Composer cluster (in our case) to execute tasks that are essentially blocking while they shouldn't be (they should be async operations).
Options
- Option 1: just launch the job and airflow job is successful
- Option 2: write a wrapper as explained here and use a reschedule mode as explained here
Option 1 does not seem feasible as the DataflowTemplateOperator
only has an option to specify the wait time between completion checks called poll_sleep
(source).
For the DataflowCreateJavaJobOperator
there is an option check_if_running
to wait for completion of a previous job with the same name (see this code)
It seems that after launching a job, the wait_for_finish
is executed (see this line), which boils down to an "incomplete" job (see this line).
For Option 2, I need Option 1.
Questions
- Am I correct to assume that Dataflow tasks will block others in Cloud Composer/Airflow?
- Is there a way to schedule a job without a "wait to finish" using the built-in operators? (I might have overlooked something)
- Is there an easy way to write this myself? I'm thinking of just executing a bash launch script, followed by a task that looks if the job finished correctly, but in a reschedule mode.
- Is there another way to avoid blocking other tasks while running dataflow jobs? Basically this is an async operation and should not take resources.