Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.
Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.
We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.
Hope to get pointed in the right direction for how to start down this path.