I am streaming data into three BigQuery tables from Cloud DataFlow sinks, but am seeing very slow query results on one of the target tables - a "small" one, about 1.5 million rows. If I stop the DataFlow streaming job and come back to the table some time later, the same query runs quickly. Here's the query in Standard SQL Dialect:
SELECT appname, start_time_to_minute, SUM(sum_client_bytes + sum_server_bytes) as `sum`
FROM `myproject.myset`.`mysmalltable`
GROUP BY 1, 2
- appname:STRING
- start_time_to_minute:TIMESTAMP
- sum_client_bytes:INTEGER
- sum_server_bytes:INTEGER
Job ID: bquijob_568af346_15c82f9e5ec - it takes 12s.
This table is growing by about 2000 rows per minute, via streaming. Another target table in the same project grows much more quickly via streaming,
maybe 200,000 rows per minute. If I run the query above on mysmalltable
while streaming, it can take close to a minute. We experienced query times of several minutes on similar queries.
Job ID: bquijob_7b4ea8a1_15c830b7f12, it takes 42.8s
If I add a filter, things get worse e.g.
WHERE REGEXP_CONTAINS(`appname`, 'Oracle')
Job ID: bquijob_502b0a06_15c830d9ecf, it takes 57s
One yesterday took over 6 minutes:
Job ID: bquijob_49761f0d_15c80c6f415, it took 6m38s
I understand that in order to support querying "live" data, BigQuery has a much less efficient data provider that operates on top of the streaming
buffer. Is this implicated here? Is there a way we can make these queries run reliably in under 30s? For example, somehow avoiding the streaming
buffer and using >1 minute old data? If the streaming buffer is implicated, it still doesn't quite add up for me, since I would've thought most of the data being read out of mysmalltable
would still be in native format.
I appreciate any guidance!