2
votes

at the moment we're using an Airflow version installed by ourselves on Kubernetes but the idea is to migrate on Cloud Composer. We're using Airflow to run dataflow jobs using a customized version of DataFlowJavaOperator (using a plugin) because we need to execute java application that isn't self-contained in a jar. So we basically run a bash script that lauch the command:

java -cp jar_folder/* MainClass

All of jar dependencies are stored in a shared disk between all the worker, but this feature is missing in Composer in which we're forced to use Cloud Storage to share job binaries. The problem is that running java program from a directory pointing to GCS using gcsfuse is extremely slow.

Do you have any suggestion to implement such scenario in Cloud Composer?

Thanks

1

1 Answers

1
votes

Composer automatically syncs content placed in the gs://{your-bucket}/dags and gs://{your-bucket}/plugins to the local Pod file system. We expect that only dag and plugin source code is copied there but don't prevent anyone from storing other binaries (though not recommended as you may exceed the disk capacity at which point the workflow execution would be affected due to insufficient local space).

fyi - the local file system paths are: /home/airflow/gcs/dags and /home/airflow/gcs/plugins, respectively.