What is the reason for delay to query data in Bigquery when partitioned at ingestion time?

Question

I was using bigquery to batch insert data from my python application. The tables were partitioned at ingestion time. The difference I saw was data that I inserted would appear for a query after 1.5 hrs after ingestion.

Later I changed the schema to have a timestamp column

This time I could query data immediately after ingestion.

Why is there a difference in behavior between _PARTITIONTIME pseudo-column vs timestamp column in the table schema?

PYTHON CODE FOR INGESTION:

This is a simplified version of the code:

bigquery_client = bigquery.Client()
TABLE_REF = bigquery_client.dataset('DATASET_ID').table('TABLE_ID')
TABLE = bigquery_client.get_table(TABLE_REF)

def ingest_to_bq(data: LIST[LIST]):
    bigquery_client.insert_rows(TABLE, data)

Table Schema:

[
    {
        "name": "epoch_ms",
        "type": "INTEGER",
        "mode": "REQUIRED"
    },
    {
        "name": "application_id",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "ack_id",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "data",
        "type": "STRING",
        "mode": "REQUIRED"
    }
]

Created the table from the BIGQUERY interface and partitioned at ingestion time.

Query:

I query again using BIGQUERY interface.

SELECT data from <DATASET_ID>.<TABLE_ID> WHERE _PARTITIONTIME="2020-03-30"

The above query would not display results that was ingested let say, half an hour back. It takes roughly 1.5 hours after ingestion to get the results.

NEW SCHEMA:

[
    {
        "name": "send_timestamp",
        "type": "TIMESTAMP",
        "mode": "REQUIRED"
    },
    {
        "name": "application_id",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "ack_id",
        "type": "STRING",
        "mode": "REQUIRED"
    },
    {
        "name": "data",
        "type": "STRING",
        "mode": "REQUIRED"
    }
]

QUERY FOR NEW SCHEMA:

SELECT data from <DATASET_ID>.<TABLE_ID> WHERE send_timestamp>="2020-03-30 00:00:00" and send_timestamp<="2020-03-30 23:59:59"

This query gives back the result immediately after ingestion. I don't have to wait.

Can you please provide more information. Your schema, the format of the timestamp you ar e using, and also the python code you are using to interact with BigQuery — Paddy Popeye
May you explain what are you doing to create your table in both cases? I dont fully understand what is the difference between the two approaches. — rmesteves

rmesteves rmesteves · Accepted Answer · 2020-03-31T18:04:26

Apparently this behavior is normal and I could find the same situation after reproducing your environment.

The explanation for this delay is the BigQuery's streaming buffer. The streaming buffer is a a buffer that retains recently inserted rows and is optimized for writing throughput. In other words, when you insert streaming data into BigQuery your data is firstly inserted into the streaming buffer where it stays up to 90 minutes. At this point, data is considered durable and you can query it however you are not allowed to do some specific operations over it.

As you can see in the documentation, when your data is in the streaming buffer, your _PARTITIONTIME pseudo column will be NULL. Given that, your query is not able to find the new inserted rows because your _PARTITIONTIME is NULL. To to be sure that the values are NULL in your pseudo column for the recently inserted data, you can run the queries below.

If you want to see the pseudo column for all the rows, run SELECT *, _PARTITIONTIME p from <DATASET_ID>.<TABLE_ID>
If you want to get all the rows where the pseudo column is null, run SELECT * from <DATASET_ID>.<TABLE_ID> WHERE _PARTITIONTIME is null

Finally, I would like to add some usefull references for this topic.

BigQuery partitioned table's reference.
BigQuery streaming reference.
An official article about streaming into BigQuery which discuss the streaming buffer and how to handle it.

I hope it helps

What is the reason for delay to query data in Bigquery when partitioned at ingestion time?

1 Answers