I was using bigquery to batch insert data from my python application. The tables were partitioned at ingestion time. The difference I saw was data that I inserted would appear for a query after 1.5 hrs after ingestion.
Later I changed the schema to have a timestamp column
This time I could query data immediately after ingestion.
Why is there a difference in behavior between _PARTITIONTIME pseudo-column vs timestamp column in the table schema?
PYTHON CODE FOR INGESTION:
This is a simplified version of the code:
bigquery_client = bigquery.Client()
TABLE_REF = bigquery_client.dataset('DATASET_ID').table('TABLE_ID')
TABLE = bigquery_client.get_table(TABLE_REF)
def ingest_to_bq(data: LIST[LIST]):
bigquery_client.insert_rows(TABLE, data)
Table Schema:
[
{
"name": "epoch_ms",
"type": "INTEGER",
"mode": "REQUIRED"
},
{
"name": "application_id",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "ack_id",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "data",
"type": "STRING",
"mode": "REQUIRED"
}
]
Created the table from the BIGQUERY interface and partitioned at ingestion time.
Query:
I query again using BIGQUERY interface.
SELECT data from <DATASET_ID>.<TABLE_ID> WHERE _PARTITIONTIME="2020-03-30"
The above query would not display results that was ingested let say, half an hour back. It takes roughly 1.5 hours after ingestion to get the results.
NEW SCHEMA:
[
{
"name": "send_timestamp",
"type": "TIMESTAMP",
"mode": "REQUIRED"
},
{
"name": "application_id",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "ack_id",
"type": "STRING",
"mode": "REQUIRED"
},
{
"name": "data",
"type": "STRING",
"mode": "REQUIRED"
}
]
QUERY FOR NEW SCHEMA:
SELECT data from <DATASET_ID>.<TABLE_ID> WHERE send_timestamp>="2020-03-30 00:00:00" and send_timestamp<="2020-03-30 23:59:59"
This query gives back the result immediately after ingestion. I don't have to wait.