0
votes

In Google BigQuery, I would like to delete a subset of records, based on the value of a specific column. It's a query that I need to run repeatedly and that I would like to run automatically.

The problem is that this specific column is of the form STRUCT<column_1 ARRAY (STRING), column_2 ARRAY (STRING), ... >, and I don't know how to use such a column in the where-clause when using the delete-command.

Here is basically what I am trying to do (this code does not work):

DELETE
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
  AND type = 'some_value'

The error that I'm getting is: Syntax error: Expected end of input but got keyword LEFT at [3:1]

If I replace the DELETE with SELECT *, it does work:

SELECT *
FROM dataset.table t
LEFT JOIN UNNEST(t.category.column_1) AS type
WHERE t.partition_date = '2020-07-22'
  AND type = 'some_value'

Does somebody know how to use such a column to delete a subset of records?

EDIT:

Here is some code to create a reproducible example with some silly data (fill in your own dataset and table name in all queries):

Suppose you want to delete all rows where category.type contains the value 'food'.

1 - create a table:

CREATE TABLE <DATASET>.<TABLE_NAME>
(
  article STRING,
  category STRUCT<
    color STRING,
    type ARRAY<STRING>
  >
);

2 - Insert data into the new table:

INSERT  <DATASET>.<TABLE_NAME>
SELECT "apple" AS article, STRUCT('red' AS color, ['fruit','food'] as type) AS category
UNION ALL
SELECT "cabbage" AS article, STRUCT('blue' AS color, ['vegetable', 'food'] as type) AS category
UNION ALL
SELECT "book" AS article, STRUCT('red' AS color, ['object'] as type) AS category
UNION ALL
SELECT "dog" AS article, STRUCT('green' AS color, ['animal', 'pet'] as type) AS category;

3 - Show that select works (return all rows where category.type contains the value 'food'; these are the rows I want to delete):

SELECT *
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'

Initial Result

4 - My attempt at deleting rows where category.type contains 'food' does not work:

DELETE
FROM <DATASET>.<TABLE_NAME>
LEFT JOIN UNNEST(category.type) type
WHERE type = 'food'

Syntax error: Unexpected keyword LEFT at [3:1]

Desired Result

2
In order to investigate further and make some tests, can you provide sample data and desired output?Alexandre Moraes

2 Answers

0
votes

Since you did not provide any sample data I am going to explain using some dummy data. In case you add your sample data, I can update the answer.

Firstly,according to your description, you have only a STRUCT not an Array[Struct <col_1, col_2>].For this reason, you do not need to use UNNEST to access the values within the data. Below is an example how to access particular data within a STRUCT.

WITH data AS (
SELECT 1 AS id, STRUCT("Alex" AS name, 30 AS age, "NYC" AS city) AS info UNION ALL 
SELECT 1 AS id, STRUCT("Leo" AS name, 18 AS age, "Sydney" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Robert" AS name, 25 AS age, "Paris" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Mary" AS name, 28 AS age, "London" AS city) AS info UNION ALL
SELECT 1 AS id, STRUCT("Ralph" AS name, 45 AS age, "London" AS city) AS info 
)
SELECT * FROM data
WHERE info.city = "London"

Notice that the STRUCT is named info and the data we accessed is city and used it in the WHERE clause.

Now, in order to delete the rows that contains an specific value within the STRUCT , in your case I assume it would be your_struct.column_1, you can use DELETE or MERGE and DELETE. I have saved the above data in a table to execute the below examples, which have the same output,

First method: DELETE

DELETE FROM `project.dataset.table`
WHERE info.city = "Sydney"

Second method: MERGE and DELETE

MERGE `project.dataset.table` a
USING (SELECT * from `project.dataset.table` WHERE info.city ="London") b
ON a.info.city =b.info.city
WHEN matched and b.id=1 then
Delete

And the output for both queries,

Row id  info.name   info.age    info.city
1   1   Alex        30          NYC
2   1   Robert      25          Paris
3   1   Ralph       45          London
4   1   Mary        28          London

As you can see the row where info.city = "Sydney" was deleted in both cases.

It is important to point out that your data is excluded from your source table. Therefore, you should be careful.

Note: Since you want to run this process everyday, you could use Schedule Query within BigQuery Console, appending or overwriting the results after each run. Also, it is a good practice not deleting data from your source table. Thus, consider creating a new table from your source table without the rows you do not desire.

0
votes

This is the code I used to delete the desired records (the records where category.type contains the value 'food'.)

DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS(SELECT 1 FROM UNNEST(t1.category.type) t2 WHERE t2 = 'food')

The embarrasing thing is that I've seen these kind of answers on similar questions (for example on update-queries). But I come from Oracle-SQL and I think that there you are required to connect your subquery with your main query in the WHERE-statement of the subquery (ie. connect t1 with t2), so I didn't understand these answers. That's why I posted this question.

However, I learned that BigQuery automatically understands how to connect table t1 and 'table' t2; you don't have to explicitly connect them.

Now it is possible to still do this (perhaps even recommended?):

DELETE
FROM <DATASET>.<TABLE_NAME> t1
WHERE EXISTS (SELECT 1 FROM <DATASET>.<TABLE_NAME> t2 LEFT JOIN UNNEST(t2.category.type) AS type WHERE type = 'food' AND t1.article=t2.article)

but a second difficulty for me was that my ID in my actual data is somehow hidden in an array>struct-construction, so I got stuck connecting t1 & t2. Fortunately this is not always an absolute necessity.