0
votes

I've got 2 tables in BigQuery tables that are aggregated and processed which may have up to 2 million and 10 million rows respectively.

They have very different columns but each as the same primary key (IDXX). In table 1 there is one row for each IDXX and in table 2 there maybe up to 10 rows with IDXX.

I'd like to export these two tables from BigQuery in matching chunks. So for example:

  • table1_chunk1.csv: Should have IDXX: 1 - 10 (10 rows)
  • table2_chunk1.csv: Should have IDXX: 1 - 10 (could be up to 100 rows)

  • table1_chunk2.csv: Should have IDXX: 11 - 20 (10 rows)

  • table2_chunk2.csv: Should have IDXX: 11 - 20 (could be up to 100 rows)

What would be the best way to do this? Use cloud Dataflow? Do it in Bash?

1
not clear what expected logic on splitting to these four chunks? and what does ` 1 - 10 (10 rows)` and 1 - 10 (could be up to 100 rows) mean? - please clarify.Mikhail Berlyant

1 Answers

0
votes

It's a very broad problem, but I will try to address it purely in BigQuery.

First thing you need is some sort of static ranking for your ids. This way, you can make the solution of retrieving a fixed set of ids deterministic. So, you can create a (one-time) table that contains the ranks for your ids:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.ranking_table \ 
"select row_number() over () as rnk, id from dataset.table1"

Once you have this static ranking_table, you can go about creating your chunks repeatedly over different ranges. For example (for 1 - 10):

For table1_chunk1:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.table1_chunk1 \ 
"select a.* from dataset.table1 a join dataset.ranking_table b on b.id=a.id  
where b.rnk between 1 and 10"

For table1_chunk2:

bq query --nouse_legacy_sql --allow_large_results --replace \ 
 --destination_table=dataset.table2_chunk1 \ 
"select a.* from dataset.table2 a join dataset.ranking_table b on b.id=a.id 
where b.rnk between 1 and 10"

And then you can go about exporting the data in these tables to a GCS bucket, and then locally. I will leave that part assuming you can figure it out.

Hope it helps.