0
votes

We have a pipeline set to receive data from Apache Alooma to Bigquery (partitioned tables based on _PARTITIONTIME).

Observation so far is data is not directly dumped into _PARTITIONTIME column rather it first goes into 'null' partition and then gradually dumped into _PARTITIONTIME column .

example:

Query 1:

select DATE(_PARTITIONTIME) , count(1) from `<table>` where 
FORMAT_DATETIME( "%Y-%m-%d %T" , <col1>) >= "2019-05-07 00:02:00" and FORMAT_DATETIME( "%Y-%m-%d %T",<col1>)<= "2019-05-07 23:59:59" 
group by 1

Output :

1 null 21492
2 2019-05-07 6633

around 3 minutes reran the query and found the output as this

1 null 26355
2 2019-05-07 6633

Note -- Pulled record with a unique key from this partition and ran Query 2.

after 10-15 minutes again reran the query with the below results:

1 2019-05-07 28125
2 null 16660

To support the observation ran below query

Query 2:

select * from `<table>`  where unique_col >= '5cd0d13d0ae0' and
DATE(_PARTITIONTIME) = '2019-05-07'

--0 record

select * from `<table>`  where unique_col  >= '5cd0d13d0ae0' and
_PARTITIONTIME is null

--output appeared

record found in _PARTITIONTIME column after 4 min

select * from `<table>`  where unique_col  >= '5cd0d13d0ae0' and
DATE(_PARTITIONTIME) = '2019-05-07'

--output appeared

As per the above observation, seems the data first gets into the temporary null partition and from null to actual partition column. Is this understanding correct?

1

1 Answers

2
votes

seems the data first gets into temporary null partition and from null to actual partition column . Is this understanding correct ?

Yes it's. As can be seen from BigQuery documentation when streaming data into partition table the data is 1st stored in a streaming buffer with NULL value in the partition field. Once BigQuery finish inserting the record the _PARTITIONTIME field will hold the relevant date value

This is the relevant text from BigQuery documentation

Newly arriving data will be temporarily associated with the UNPARTITIONED partition while in the streaming buffer. A query can therefore exclude data in the streaming buffer from a query by filtering out the NULL values from the UNPARTITIONED partition by using one of the pseudocolumns