For my use case I'm working with data identifiable by unique key at the source exploded into n (non deterministic) number of target entries loaded into BigQuery tables for analytic purposes.
Building this ETL to use Mongo recent Change Stream feature I would like to drop all entries in BigQuery and then load the new entries atomically.
Exploring BigQuery DML I see a MERGE operation is supported, but only WHEN MATCHED THEN DELETE or WHEN MATCHED THEN UPDATE is possible.
I'm interested in a WHEN MATCHED THEN DELETE, AND FOLLOW BY AN INSERT operation.
How would I implement such ETL in BigQuery while remaining atomic or eventually consistent as possible in terms of data availability and correctness.
EDIT 1: I would like to provide a concrete example to elaborate.
The lowest granularity of uniqueness I have on this dataset is user_id. Rows are not uniquely identifiable.
Example
1.
Updated user object received from mongo change stream:
user={_id: "3", name="max", registered="2018-07-05" q=["a", "b", "c"]}
2.
Current BigQuery.user_q holds
| user_id | q |
...
| 3 | a |
| 3 | b |
...
3.
Transform code loads modified user object into BigQuery.user_q_incoming
| user_id | q |
| 3 | a |
| 3 | b |
| 3 | c |
4.
MERGE between user_q and user_q_incoming:
- 2 rows in
user_qthat belong touser_id 3are DELETED - 3 rows in
user_q_incomingthat belong touser_id 3are INSERTED. - Rest of the data (
...) inuser_qis left in place, unmodified.
5.
BigQuery.user_q holds
| user_id | q |
...
| 3 | a |
| 3 | b |
| 3 | c |
...
For example user might delete a question from his profile. Leaving the remaining rows to be q=["a", "c"]. I need this to translate into the BigQuery world view as well.