1
votes

Background: We have a really simple pipeline which reads some data from BigQuery (usually ~300MB) filters/transforms it and puts it back to BigQuery. in 99% of cases this pipeline finishes in 7-10minutes and is then restarted again to process a new batch.

Problem: Recently, the job has started to take >3h once in a while, maybe 2 times in a month out of 2000 runs. When I look at the logs, I can't see any errors and in fact it's only the first step (read from BigQuery) that is taking so long.

Does anyone have a suggestion on how to approach debugging of such cases? Especially since it's really the read from BQ and not any of our transformation code. We are using Apache Beam SDK for Python 0.6.0 (maybe that's the reason!?)

Is it maybe possible to define a timeout for the job?

enter image description here

1
Please include the Dataflow job ID so that someone on Dataflow team can take a look at it and help debug the performance. - jkff
thanks @jkff, the slow job_id in question is "2018-01-24_21_26_22-2131680617017922084". And here is an id for the same pipeline but which had expected execution time of ~10min: "2018-01-24_23_31_21-15706979146276820485" - Dimitri Masin
Here another example of a slow job "2018-01-16_11_06_28-7923202670027546242" (which I had to cancel in the end). - Dimitri Masin

1 Answers

3
votes

This is an issue on either Dataflow side or BigQuery side depending on how one looks at it. When splitting the data for parallel processing, Dataflow relies on an estimate of the data size. The long runtime happens when BigQuery sporadically gives a severe under-estimate of the query result size, and Dataflow, as a consequence, severely over-splits the data and the runtime becomes bottlenecked by the overhead of reading lots and lots of tiny file chunks exported by BigQuery.

On one hand, this is the first time I've seen BigQuery produce such dramatically incorrect query result size estimates. However, as size estimates are inherently best-effort and can in general be arbitrarily off, Dataflow should control for that and prevent such oversplitting. We'll investigate and fix this.

The only workaround that comes to mind meanwhile is to use the Java SDK: it uses quite different code for reading from BigQuery that, as far as I recall, does not rely on query size estimates.