2
votes

I am streaming data into three BigQuery tables from Cloud DataFlow sinks, but am seeing very slow query results on one of the target tables - a "small" one, about 1.5 million rows. If I stop the DataFlow streaming job and come back to the table some time later, the same query runs quickly. Here's the query in Standard SQL Dialect:

SELECT appname, start_time_to_minute, SUM(sum_client_bytes + sum_server_bytes) as `sum`
FROM `myproject.myset`.`mysmalltable`
GROUP BY 1, 2
  • appname:STRING
  • start_time_to_minute:TIMESTAMP
  • sum_client_bytes:INTEGER
  • sum_server_bytes:INTEGER

Job ID: bquijob_568af346_15c82f9e5ec - it takes 12s.

This table is growing by about 2000 rows per minute, via streaming. Another target table in the same project grows much more quickly via streaming, maybe 200,000 rows per minute. If I run the query above on mysmalltable while streaming, it can take close to a minute. We experienced query times of several minutes on similar queries.

Job ID: bquijob_7b4ea8a1_15c830b7f12, it takes 42.8s

If I add a filter, things get worse e.g.

WHERE REGEXP_CONTAINS(`appname`, 'Oracle')

Job ID: bquijob_502b0a06_15c830d9ecf, it takes 57s

One yesterday took over 6 minutes:

Job ID: bquijob_49761f0d_15c80c6f415, it took 6m38s

I understand that in order to support querying "live" data, BigQuery has a much less efficient data provider that operates on top of the streaming buffer. Is this implicated here? Is there a way we can make these queries run reliably in under 30s? For example, somehow avoiding the streaming buffer and using >1 minute old data? If the streaming buffer is implicated, it still doesn't quite add up for me, since I would've thought most of the data being read out of mysmalltable would still be in native format.

I appreciate any guidance!

1

1 Answers

0
votes

I've seen also this behaviour, the way that I workaround it (I'm not going to say solved, because that's mostly something from Google), was to use micro-batching instead of stream inserts. When the concurrency is low, the stream inserts work really well, but with real BigData (like in my case, hundreds of thousands) the best way is to use micro-batching. I'm using the FILE_LOADS option with a windowing of 3 minutes and works really well. Hopefully that can help you.