0
votes

I have two BigQuery datasets: dataset_a and dataset_b

Each of these datasets contain a table, e.g dataset_a_table and dataset_b_table

dataset_a_table contains streaming data and I want to stream data from dataset_a_table to dataset_b_table.

I have schema of dataset_a_table of type TableSchema. How can I copy stream rows from one table to another and keep the existing schema?

I have so far looked at insertAll method of BigQuery but I am a bit unsure about which data structure to fetch rows in and how to specify TableSchema when inserting into a new table.

I would appreciate some guidance regarding how to do that. Thanks.

1

1 Answers

1
votes

Approach 1: If dataset_b_table needs to simply mirror dataset_a_table, for instance because you have different user permissions on the two datasets, you could consider setting up dataset_b_table as a view instead of a table. Views in BigQuery work across datasets:

CREATE VIEW dataset_b.dataset_b_view AS SELECT * FROM dataset_a.dataset_a_table

Approach 2: If you do want dataset_b_table with the same schema as dataset_a_table, you can use the BigQuery native "transfers" functionality. ("Transfers" > "Create Transfer" > select "Dataset Copy")

Approach 3: If dataset_b_table has a different schema from dataset_a_table, or if dataset_b_table already contains data and you want to merge in data from dataset_a_table, you will need some sort of incremental logic. Assuming your dataset_a_table has some sort of "created_at" field (also assuming no updates to records) then you could go with an incremental load like this:

INSERT INTO dataset_b.dataset_b_table
SELECT
   column_a, column_b, column_c, updated_at
FROM dataset_a.dataset_a_table
WHERE updated_at>(SELECT max(updated_at) FROM dataset_b.dataset_b_table)

You can then schedule this to run depending on your timing requirements, once a day, hour, or every couple of minutes. You can use the BigQuery native scheduling functionality, or your own logic.

If you need actual streaming in (milli)seconds, and the View approach doesn't work for you, you will need to work with the source that fills dataset_a_table in the first place as BigQuery doesn't support triggers.