1
votes

I would like to build the following pipeline:

pub/sub --> dataflow --> bigquery

The data is streaming, but I would like to avoid streaming the data directly into BigQuery, therefore I was hoping to batch up small chunks in the dataflow machine and then write them into BQ as a load job when they reach a certain size/time.

I cannot find any examples of how to do this using the python apache beam SDK - only Java.

1
Why do you want to avoid streaming it into BigQuery?! - Graham Polley
Hey @GrahamPolley because there is a cost associated with streaming inserts, whilst load jobs are free :) - dendog
True, but it's usually negligible unless you're running at massive scale. Creating some sort of micro-batch off PubSub will require more dev time and it will have more moving components i.e. more failure points and areas to debug. Is it really worth it? If you micro-batch, you'll need to write out to GCS beforehand, and then pay for the storage too. - Graham Polley

1 Answers

4
votes

This is work in progress. The FILE_LOADS method is only available for batch pipelines (with the use_beam_bq_sink experiments flag, it will be the default one in the future.

However, for streaming pipelines, as seen in the code it will raise a NotImplementedError with message:

File Loads to BigQuery are only supported on Batch pipelines.

There is an open JIRA ticket where you can follow the progress.