2
votes

BigQuery allows for querying of external tables in various storage classes, including Coldline. Accessing data from Coldline has a data retrieval fee.

Parquet format files provide columnar storage. When accessing Parquet format files from Coldline GCS via BigQuery, is the data retrieval cost based on the columns of data queried or for the entire Parquet file?

1

1 Answers

2
votes

To address the easy part of your question first, BigQuery charges based on the logical (uncompressed) size of just the columns read for all files that need to be read. If you read an integer field "foo" in a file that has 1M rows, you'll get charged 8MB (8 bytes per int * # of rows).

If a file can be skipped either due to Hive partition pruning or because the Parquet header contains information that says the file is not necessary for the query, then there are no charges for scanning that file.

The other part of your question is in regards to billing of reads from Coldline. If you read from coldline in BigQuery, you will not be billed for coldline reads. That said, please do not count on this staying to be the case for the long term. There is discussion going on internally within Google about how to close this hole.

In the future, when coldline reads are charged, most likely it will be as follows: the total amount of physical bytes necessary to run the query will be billed.

Parquet files have headers containing file metadata, then blocks with their own metadata, and columns. To read a parquet file you need to read the file header, the block headers, and the columns. Depending on the filter, some blocks may be skippable, in which case you won't get charged for then. On the other hand, some queries may require reading the same file multiple times (e.g. a self-join). The physical read size would then be the sum of all the bytes read for each time the file was read.