0
votes

I created a beam pipeline that I am running on dataflow. The pipeline contains 4 steps:

  1. read file contents
  2. convert file contents to json
  3. transform the json entries
  4. save transformed json entries into GCS

The probleme is that steps 3 and 4 are blocked waiting for steps 1 and 2 to finish reading all files.. Is there an explanation why the latest steps don't just handle each file data on the flow ?

1

1 Answers

0
votes

Batch Dataflow pipelines run in stages, where each stage waits for its inputs before starting. See https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization for information on how Dataflow divides up pipelines into stages.