3
votes

I found a couple related questions, but no definitive answer from the Google team, for this particular question:

Is a Cloud DataFlow job, writing to BigQuery, limited to the BigQuery quota of 100K rows-per-second-per-table (i.e. BQ streaming limit)?

google dataflow write to bigquery table performance

Cloud DataFlow performance - are our times to be expected?


Edit: The main motivation is to find a way to predict runtimes for various input sizes.

I've managed to run jobs which show > 180K rows/sec processed via the Dataflow monitoring UI. But I'm unsure if this is somehow throttled on the insert into the table, since the job runtime was slower by about 2x than a naive calculation (500mm rows / 180k rows/sec = 45 mins, which actually took almost 2 hrs)

1

1 Answers

6
votes

From your message, it sounds like you are executing your pipeline in batch, not streaming, mode.

In Batch mode, jobs run on the Google Cloud Dataflow service do not use BigQuery's streaming writes. Instead, we write all the rows to be imported to files on GCS, and then invoke a BigQuery load" job. Note that this reduces your costs (load jobs are cheaper than streaming writes) and is more efficient overall (BigQuery can be faster doing a bulk load than doing per-row imports). The tradeoff is that no results are available in BigQuery until the entire job finishes successfully.

Load jobs are not limited by a certain number of rows/second, rather it is limited by the daily quotas.

In Streaming mode, Dataflow does indeed use BigQuery's streaming writes. In that case, the 100,000 rows per second limit does apply. If you exceed that limit, Dataflow will get a quota_exceeded error and will then retry the failing inserts. This behavior will help smooth out short-term spikes that temporarily exceed BigQuery's quota; if your pipeline exceeds quota for a long period of time, this fail-and-retry policy will eventually act as a form of backpressure that slows your pipeline down.

--

As for why your job took 2 hours instead of 45 minutes, your job will have multiple stages that proceed serially, and so using the throughput of the fastest stage is not an accurate way to estimate end-to-end runtime. For example, the BigQuery load job is not initiated until after Dataflow finishes writing all rows to GCS. Your rates seem reasonable, but please follow up if you suspect a performance degradation.