Export two tables from BigQuery to csv based on primary key

Question

I've got 2 tables in BigQuery tables that are aggregated and processed which may have up to 2 million and 10 million rows respectively.

They have very different columns but each as the same primary key (IDXX). In table 1 there is one row for each IDXX and in table 2 there maybe up to 10 rows with IDXX.

I'd like to export these two tables from BigQuery in matching chunks. So for example:

table1_chunk1.csv: Should have IDXX: 1 - 10 (10 rows)
table2_chunk1.csv: Should have IDXX: 1 - 10 (could be up to 100 rows)
table1_chunk2.csv: Should have IDXX: 11 - 20 (10 rows)
table2_chunk2.csv: Should have IDXX: 11 - 20 (could be up to 100 rows)

What would be the best way to do this? Use cloud Dataflow? Do it in Bash?

not clear what expected logic on splitting to these four chunks? and what does ` 1 - 10 (10 rows)` and 1 - 10 (could be up to 100 rows) mean? - please clarify. — Mikhail Berlyant

khan khan · Accepted Answer · 2019-10-16T16:20:24

It's a very broad problem, but I will try to address it purely in BigQuery.

First thing you need is some sort of static ranking for your ids. This way, you can make the solution of retrieving a fixed set of ids deterministic. So, you can create a (one-time) table that contains the ranks for your ids:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.ranking_table \ 
"select row_number() over () as rnk, id from dataset.table1"

Once you have this static ranking_table, you can go about creating your chunks repeatedly over different ranges. For example (for 1 - 10):

For table1_chunk1:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.table1_chunk1 \ 
"select a.* from dataset.table1 a join dataset.ranking_table b on b.id=a.id  
where b.rnk between 1 and 10"

For table1_chunk2:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.table2_chunk1 \ 
"select a.* from dataset.table2 a join dataset.ranking_table b on b.id=a.id 
where b.rnk between 1 and 10"

And then you can go about exporting the data in these tables to a GCS bucket, and then locally. I will leave that part assuming you can figure it out.

Hope it helps.

Export two tables from BigQuery to csv based on primary key

1 Answers