4
votes

I'm currently working on a POC and primarily focusing on Dataflow for ETL processing. I have created the pipeline using Dataflow 2.1 Java Beam API, and it takes about 3-4 minutes just to initialise, and also it takes about 1-2 minutes for termination on each run. However, the actual transformation (ParDo) takes less than a minute. Moreover, I tried running the jobs by following different approaches,

  • Running the job on local machine
  • Running the job remotely on GCP
  • Running the job via Dataflow template

But it looks like, all the above methods consume more or less same time for initialization and termination. So this is being a bottleneck for the POC as we intend to run hundreds of jobs every day.

I'm looking for a way to share the initialisation/termination time across all jobs so that it can be a one-time activity or any other approaches to reduce the time.

Thanks in advance!

1

1 Answers

1
votes

From what I know, there are no ways to reduce startup or teardown time. You shouldn't consider that to be a bottleneck, as each run of a job is independent of the last one, so you can run them in parallel, etc. You could also consider converting this to a streaming pipeline if that's an option to eliminate those times entirely.