0
votes

I wanted to understand the working of Dataflow pipeline.

In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 mb and 10 sec latency.

The question is that when published in a batch does Dataflow SQL takes in all the messages in the batch and writes it to BigQuery all in one go or it writes one message at a time?

And is there any benefit of one over the other?

Please comment if any other details required. Thanks

1

1 Answers

1
votes

Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.

Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.

If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.

Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.