I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,
- Running the job on local machine
- Running the job remotely on GCP
- Running the job via Dataflow template
But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.
I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.
Thanks in advance!