1
votes

I need to delete rows that match a given WHERE condition from a partitioned BigQuery table. The table always has a streaming buffer adding more data to it. I am not concerned with deleting from what is being streamed in, just historical data in yesterday's partition.

What is the correct strategy to remove data while a streaming buffer exists on a BigQuery table, ideally without downtime?

2
See also another answer from @AlexeyMaloletkin in case you want to drop old partitions from ingestion-time partitioned tables in their entirety. - Lefteris S
@mark, found the solution? i need to delete old duplicated rows which are not in streaming buffer too - datdinhquoc

2 Answers

1
votes

You can use a Data Manipulation Language DELETE statement. However keep in mind the following (from DML docs):

Support for using Data Manipulation Language statements to modify partitioned table data is currently in Beta.

But you can always do select with filtering out records you want to delete and write the results back to the same partition. There will be no downtime. Cost will be the same as the cost of a full single partition scan.

1
votes

From this page Data Manipulation Language

"Rows that were written to a table recently via streaming (using the tabledata.insertall method) cannot be modified using UPDATE, DELETE, or MERGE statements. Recent writes are typically those that occur within the last 30 minutes. Note that all other rows in the table remain modifiable by using UPDATE, DELETE, or MERGE statements."

This means that you should restrict your DML with a time, ideally you should have date created column so you can use that OR if there is a built in metadata column that you can use but I am not aware of such column