1
votes

I've got a question about BQ performance in various scenarios, especially revolving around parallelization "under the hood".

I am saving 100M records on a daily basis. At the moment, I am rotating tables every 5 days to avoid high charges due to full table scans.

If I were to run a query with a date range of "last 30 days" (for example), I would be scanning between 6 (if I am at the last day of the partition) and 7 tables.

I could, as an alternative, partition my data into a new table daily. In this case, I will optimize my expenses - as I'm never querying more data than I have too. The question is, will be suffering a performance penalty in terms of getting the results back to the client, because I am now querying potentially 30 or 90 or 365 tables in parallel (Union).

To summarize:

  • More tables = less data scanned
  • Less tables =(?) longer response time to the client

Can anyone shed some light on how to find the balance between cost and performance?

1

1 Answers

2
votes

A lot depends how you write your queries and how much development costs, but that amount of data doesn't seam like a barrier, and thus you are trying to optimize too early.

When you JOIN tables larger than 8MB, you need to use the EACH modifier, and that query is internally paralleled.

This partitioning means that you can get higher effective read bandwidth because you can read from many of these disks in parallel. Dremel takes advantage of this; when you run a query, it can read your data from thousands of disks at once.

Internally, BigQuery stores tables in shards; these are discrete chunks of data that can be processed in parallel. If you have a 100 GB table, it might be stored in 5000 shards, which allows it to be processed by up to 5000 workers in parallel. You shouldn’t make any assumptions about the size of number of shards in a table. BigQuery will repartition data periodically to optimize the storage and query behavior.

Go ahead and create tables for every day, one recommendation is that write your create/patch script that creates tables for far in the future when it runs eg: I create the next 12 months of tables for every day now. This is better than having a script that creates tables each day. And make it part of your deploy/provisioning script.

To read more check out Chapter 11 ■ Managing Data Stored in BigQuery from the book.