Google Cloud Dataflow Stream + Batch

Question

I am building an infrastructure where I'd like to sink hot and cold data separately. For the hot data, I'm writing my data to Cloud Spanner, and for the cold data, I'd like to write my data to something more persistent like BigQuery.

I'm consuming data from a streaming service, but I'd like to take advantage of BigQuery's caching mechanism - which won't be possible if I'm constantly streaming the cold data into BigQuery. My problem is around whether I can fork a stream pipeline into a batch pipeline and have the stream pipeline connected to Spanner and the batch pipeline connected to BigQuery.

I can envision something along the lines of writing the cold data into Cloud Storage and reading the data into BigQuery using a cron job, but is there a better/native way to achieve the Stream+Batch split?

Just as an aside.. if someone downvoted this question, I'd like to know why you'd think this deserves a downvote. — bitnahian

Kenn Knowles Kenn Knowles · Accepted Answer · 2020-08-10T14:40:43

While it is true that Dataflow has batch and streaming execution modes, you can use the streaming mode to do anything you can do in batch mode (costs and scalability may differ). Since your input is a stream, aka an unbounded data source, your pipeline will run in streaming mode automatically.

It sounds like the FILE_LOADS method of writing to BigQuery may be what you want, and you can use withTriggeringFrequency to manage how often data is written.

Google Cloud Dataflow Stream + Batch

1 Answers