1
votes

We want to separate our data in tables with before and after processing.

The data currently resides in tables in a BigQuery dataset with name "pre".

The question now is whether to place the resulting table in a different dataset with name "post", or to keep it in the same dataset.

To consider: scripts will be joining tables from both datasets in their queries.

Is there any internal optimization in BigQuery when joining tables in one same dataset and that we will miss by splitting that in two?

For example, data locality? One might think that joining a table from a dataset in Europe with one from a dataset in the US might lead to extra effort to move data around. Could that incur extra network costs? CPU? But in our case, both datasets will be US.

Regarding access control, different data sets can be used to assign different permissions to one or another service account; but we will not make use of that.

1

1 Answers

1
votes

Is there any internal optimization in BigQuery when joining tables in one same dataset and that we will miss by splitting that in two?

I'm not aware of any optimization for tables in same dataset. Regarding locality, any given query is regional. A query cannot join a table in US with a table in EU. Once a query is validated it is guaranteed that all the tables to be accessed is in the current region.

Regarding access control, different data sets can be used to assign different permissions to one or another service account; but we will not make use of that.

Access control doesn't have impact on performance. Your query might be accessing datasets which have distinct permission settings, but as long as the acting account has all the permission needed, it doesn't matter whether the tables reside in same dataset or different ones.