Comparison of loading from different file formats in BigQuery

Question

We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could just tell me in the answer) how loading different formats would compare in efficiency.

For example, if we have the same 100M rows of data, does BigQuery show any performance difference from loading it in:

parquet
csv
json
avro

I'm sure one of the answers will be "why don't you test it", but we're hoping that before architecting a converter or re-writing our application, an engineer could share with us what (if any) of the above formats would be the most performant in terms of loading data from a flat file into BQ.

Note: all of the above files would be stored in Google Cloud Storage: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage.

Felipe Hoffa Felipe Hoffa · Accepted Answer · 2018-03-14T00:38:04

https://cloud.google.com/blog/big-data/2016/03/improve-bigquery-ingestion-times-10x-by-using-avro-source-format

"Improve BigQuery ingestion times 10x by using Avro source format"

The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.

In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values

Comparison of loading from different file formats in BigQuery

2 Answers