7
votes

We have a table partitioned by day in BigQuery, which is updated by streaming inserts.

The doc says that: "when streaming to a partitioned table, data in the streaming buffer has a NULL value for the _PARTITIONTIME pseudo column"

But if I query for select count(*) from table where _PARTITIONTIME is NULL it always returns 0, even though bq show tells me that there are a lot of rows in the streaming buffer.

Does this mean that the pseudo column is not present at all for rows in streaming buffer? In any case, how can I query for the data ONLY in the streaming buffer without it becoming a full table scan?

Thanks in advance

2
what is the practical use case for this? i don't think you can query/read streaming buffer, but if you explain why you think you need to be able to read it - we might figure out how to make itMikhail Berlyant
I have a streaming job that keeps updating a table in BigQuery. I have a downstream job that triggers every 15 mins and aggregates the data for the day so far - so it needs to query something equivalen to where _PARTITIONTIME = today OR data_in_streaming_buffer. Is there any way to achieve this? Thanks.Venkatesh Iyer

2 Answers

9
votes

Data in the streaming buffer has a NULL value for the _PARTITIONTIME column.

SELECT
  fields
FROM
  `dataset.partitioned_table_name`
WHERE
  _PARTITIONTIME IS NULL

https://cloud.google.com/bigquery/docs/partitioned-tables#copying_to_partitioned_tables

3
votes

When you stream data to BQ you usually have the "warming-up" period and that's the time it takes for the streamed data to become available for operations such as querying, copying and exporting.

The doc states in the end that after a period of up to 90 mins the pseudo-column _PARTITIONTIME receives a non-null value, which means your streamed data is fully ready for any type of operation you want to run on the data (being able to run queries usually takes a few seconds).

That means that you don't query partitioned tables looking for when this field is null but instead, you do like so:

SELECT
  fields
FROM
  `dataset.partitioned_table_name`
WHERE
  _PARTITIONTIME = TIMESTAMP('2017-01-20') 

In this example, you would be querying only data streamed in the dates partition Jan/20 (which avoids a full table scan).

You can also select for a range of dates, you would just have to change the WHERE clause to:

WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-20') AND TIMESTAMP('2017-01-22') 

Which would query for 2 days in your table.