4
votes

We have a streaming Dataflow pipeline running on Google Cloud Dataflow workers, which needs to read from a PubSub subscription, group messages, and write them to BigQuery. The built-in BigQuery Sink does not fit our needs as we need to target specific datasets and tables for each group. As the custom sinks are not supported for streaming pipelines, it seems like the only solution is to perform the insert operations in a ParDo. Something like this:

enter image description here

Is there any known issue with not having a sink in a pipeline, or anything to be aware of when writing this kind of pipeline?

1
We have a pipeline that needs to write to different datasets/tables depending on the input. We use side outputs to write to N BigQuery sinks. Could this work for you too?Graham Polley
I thought about that, but how large is your N? For us N would be around 1 million.Thomas
Ehh...we have about 10-20 sinks. I think 1 million might be a problem! This sounds like an odd request - the fact that you need to write to a million different tables in BigQuery. Can you elaborate a little more on the problem you are trying to solve and give more context?Graham Polley

1 Answers

4
votes

There should not be any issues for writing a pipeline without a sink. In fact, a sink is a type of ParDo in streaming.

I recommend that you use a custom ParDo and use the BigQuery API with your custom logic. Here is the definition of the BigQuerySink, you can use this code as a starting point.

You can define your own DoFn similar to StreamingWriteFn to add your custom ParDo logic, which will write to the appropriate BigQuery dataset/table.

Note that this is using Reshuffle instead of GroupByKey. I recommend that you use Reshuffle, which will also group by key, but avoid unnecessary windowing delays. In this case it means that the elements should be written out as soon as they come in, without extra buffering/delay. Additionally, this allows you to determine BQ table names at runtime.

Edit: I do not recommend using the built in BigQuerySink to write to different tables. This suggestion is to use the BigQuery API in your custom DoFn, rather than using the BigQuerySink