0
votes

Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.

Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.

We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.

Hope to get pointed in the right direction for how to start down this path.

1

1 Answers

1
votes

Stream the full row on every update to BigQuery.
Let the table accommodate multiple rows for the same primary entity.
Write a view eg table_last that picks the most recent row.

This way you have all your queries near-realtime on real data. You can deduplicate occasionally the table by running a query that rewrites self table with latest row only.

Another approach is if you have 1 final table, and 1 table which you stream into, and have a MERGE statement that runs scheduled every X minutes to write the updates from streamed table to final table.