1
votes

We currently load most of our data into BigQuery either via csv or directly via the streaming API. However, I was wondering if there were any benchmarks available (or maybe a Google engineer could just tell me in the answer) how loading different formats would compare in efficiency.

For example, if we have the same 100M rows of data, does BigQuery show any performance difference from loading it in:

  • parquet
  • csv
  • json
  • avro

I'm sure one of the answers will be "why don't you test it", but we're hoping that before architecting a converter or re-writing our application, an engineer could share with us what (if any) of the above formats would be the most performant in terms of loading data from a flat file into BQ.

Note: all of the above files would be stored in Google Cloud Storage: https://cloud.google.com/bigquery/docs/loading-data-cloud-storage.

2

2 Answers

4
votes

"Improve BigQuery ingestion times 10x by using Avro source format"

The ingestion speed has, to this point, been dependent upon the file format that we export from BigQuery. In prior releases of the SDK, tables and queries were made available to Dataflow as JSON-encoded objects in Google Cloud Storage. Considering that every such entry has the same schema, this representation is extremely redundant, essentially duplicating the schema, in string form, for every record.

In the 1.5.0 release, Dataflow uses the Avro file format to binary-encode and decode BigQuery data according to a single shared schema. This reduces the size of each individual record to correspond to the actual field values

0
votes

Take care not to limit your comparison to just benchmarks. Those formats also imply some limitations for the client that writes data into BigQuery, and you should also consider them. For instance: