0
votes

We are using DataFlow to read from a set of PubSub topics and write the data to BigQuery. We are currently using one DataFlow job for each topic and writing them to the related BigQuery table. Is it possible to write one Dataflow job for this?

I see documentation about multiple sources to one output here: https://cloud.google.com/dataflow/pipelines/design-principles?hl=en#multiple-sources

Is there anything keeping me from just doing multiple "basic" pipelines in the same DataFlow job like in the basic flow: https://cloud.google.com/dataflow/pipelines/design-principles?hl=en#a-basic-pipeline

The documentation and my understanding of the code implies this can be done, but I'd like to be sure before I embark on the effort.

1

1 Answers

1
votes

My understanding is that there is nothing "wrong" with doing that and it can be done, it just depends on what you are trying to achieve, and the design decisions that are relevant to you. For example if you expect certain topics to have more throughput, one possible benefit of splitting them is it allows you to scale up independently to handle specific topics.

In my case I am taking multiple topics, applying some set of transforms and creating a PCollectionList, eventually writing them out to BigQuery. This is all done in one job, and I am programmatically generating the transforms prior to running.