BigQuery range decorator duplicate issue

Question

We are facing issues with BigQuery range decorators on streaming table. The range decorator queries give duplicate data.

My case:

My BQ table is getting data regularly from customer events through streaming inserts. Another job is periodically fetching time bound data from the table using range decorator and sending it to dataflow jobs. like

First time fetching all the data from table using

SELECT * FROM [project_id:alpha.user_action@1450287482158]

when i ran this query got 91 records..

after 15 mins another query based on last interval

SELECT * FROM [alpha.user_action@1450287482159-1450291802380]

this also gave the same result with 91 records.

however i tried to run the same query again to cross check

SELECT * FROM [project_id:alpha.user_action@1450287482158]

Gives empty data.

any help on this?

Sean Chen Sean Chen · Accepted Answer · 2015-12-17T20:02:39

First off, have you tried using streaming dataflow? That might be a better fit (though your logic is not expressible as a query). Streaming dataflow also supports Tee-ing your writes, so you can keep both raw data and aggregate results.

On to your question:

Unfortunately this is a collision of two concepts that were built concurrently and somewhat independently, thus resulting in ill-defined interactions.

Time range table decorators were designed/built in a world where only load jobs existed. As such, blocks of data are atomically committed to a table at a single point in time. Time range decorators work quite well with this, as there are clear boundaries of inclusion/exclusion, and the relationship is stable.

Streaming Ingestion + realtime query is somewhat counter to the "load job" world. BigQuery is buffering your data for some period of time, making it available for analysis, and then periodically flushing the buffers onto the table using the traditional storage means. While the data is buffered, we have "infinite" time granularity. However, once we flush the buffer onto the table, that infinite granularity is compressed into a single time, which is currently the flush time.

Thus, using time range decorators on streaming tables can unfortunately result in some unexpected behaviors, as the same data may appear in two non-overlapping time windows (once while it is buffered, and once when it is flushed).

Our recommendation if you're trying to achieve windowed queries over recent data is to do the following:

Include a system timestamp in your data.
For the table decorator timestamps, include some buffer around the actual window to account for clock skew between your clock and Google's, and late arrivals from retry. This buffer should be both before and after your target window.
Modify your query to apply your actual time window.

It should be noted that depending on your actual usage purpose, this may not address your problems. If you can give more details, there might be a way to achieve what you need.

Sorry for the inconvenience.

BigQuery range decorator duplicate issue

1 Answers