0
votes

I have a table t in BigQuery that contains ~5billion rows (~80TB) and it is partitioned on column dateTimeCreated which is of type TIMESTAMP and is partitioned by DAY. The table contains data for 5 years so no more than 1825 partitions.

bigquery table details

I'd like to find out how many rows exist in the table per day so I crafted this SQL query:

select timestamp_trunc(datetimecreated,DAY),count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc

I was hoping that BigQuery would be able to return the results rapidly because this is basically counting the number of rows in each partition which, I would expect, is a tally that BigQuery maintains as internal metadata anyway (that's certainly my experience when using ingestion time partitioned tables).

Unfortunately that seems to not be the case. It took BigQuery 73s to return the result:

Query complete (1 min 13 sec elapsed, 37.4 GB processed)

I'm curious if there's a more efficient way to query this table. If it were an ingestion-time partitioned table my query would be:

select _PARTITION_DATE,count(*)
from `p.d.t`
where datetimecreated > '2000-01-01'
group by 1
order by 1 desc

which I'm confident would return very quickly. This isn't an ingestion-time partitioned table though.

Is there a more efficient method to achieve my desired result?

Another question,, does BigQuery provide queryable metadata per partition that includes the cardinality of the partition?

1

1 Answers

1
votes

Found the answer, this does the job:

SELECT table_name, partition_id, total_rows
FROM `p.d.INFORMATION_SCHEMA.PARTITIONS`
WHERE partition_id IS NOT NULL
 and table_name = 't'
order by partition_id desc

it returns quickly and, of course, queries much less data.

Query complete (1.7 sec elapsed, 10 MB processed)