BigQuery: How do I add a field to a REPEATED record?

Question

I've got a table in Google BigQuery that consists of a few fields, then a REPEATED record which may contain one or more objects. I want to create a new table with an extra field in the REPEATED data, and copy my original data into the new table, populating the new field with the output of GENERATE_UUID() so there is one unique identifier per REPEATED line of data.

I had a similar question at How do I copy from one BigQuery Table to another when the target contains REPEATED fields? but I don't know how to adapt this to fit my current use case.

Here's my "new" Schema 1 (ie Schema 2 from the above link)

[
    {"name": "id", "type": "NUMERIC", "mode": "REQUIRED"},
    {"name": "name", "type": "STRING", "mode": "REQUIRED"},
    {"name": "created", "type": "TIMESTAMP", "mode": "REQUIRED"},
    {"name": "valid", "type": "BOOLEAN", "mode": "REQUIRED"},
    {"name": "parameters", "type": "RECORD", "mode": "REPEATED", "fields":
        [
            {"name": "parameter1", "type": "STRING", "mode": "REQUIRED"},
            {"name": "parameter2", "type": "FLOAT", "mode": "REQUIRED"},
            {"name": "parameter3", "type": "BOOLEAN", "mode": "REQUIRED"}
        ]
    }
]

and I'd like it to end up like this, Schema 2:

[
    {"name": "id", "type": "NUMERIC", "mode": "REQUIRED"},
    {"name": "name", "type": "STRING", "mode": "REQUIRED"},
    {"name": "created", "type": "TIMESTAMP", "mode": "REQUIRED"},
    {"name": "valid", "type": "BOOLEAN", "mode": "REQUIRED"},
    {"name": "parameters", "type": "RECORD", "mode": "REPEATED", "fields":
        [
            {"name": "uuid", "type": "STRING", "mode": "REQUIRED"},
            {"name": "parameter1", "type": "STRING", "mode": "REQUIRED"},
            {"name": "parameter2", "type": "FLOAT", "mode": "REQUIRED"},
            {"name": "parameter3", "type": "BOOLEAN", "mode": "REQUIRED"}
        ]
    }
]

So I've got my new table (Table 2) created with this Schema. I want to copy from Table 1, and I'm trying something like this:

insert into table2_with_uuid(id, name, created, valid, parameters)
select id, name, created, valid,
[(
GENERATE_UUID(), parameters.parameter1, parameters.parameter2, parameters.parameter3
)]
from table1_no_guid;

This gives me an error saying: Cannot access field ceId on a value with type ARRAY<STRUCT<parameter1 (etc)

Does anyone have any suggestions as to how to proceed? Thanks!

aemon4 aemon4 · Accepted Answer · 2020-04-07T12:35:51

I have followed the procedure in Data Manipulation Language syntax on the official documentation.

Then, basically what you want is to update repeated records. I have followed all the examples, from the inserts to the updates up to the moment in which a second comment is added to the repeated record.

Then I applied the UNNEST query:

insert into `testing.followingDMLmod`  (product, quantity, supply_constrained, comments)
select product, quantity, supply_constrained,
[(
GENERATE_UUID(), com.created, com.comment
)]
from `testing.followingDML` , UNNEST(comments) com;

which of course works but does not provide the desired result.

As per the official documentation "BigQuery natively supports several schema changes such as adding a new nested field to a record or relaxing a nested field's mode." Then, perhaps the path is copying the table and afterwards adding the extra field.

That can be done following managing table schemas documentation. That is, either using the API and calling tables.patch, which was discussed with more detail in this other stack overflow post, or using a JSON file with the schema from the command line.

I have personally followed the second approach (JSON schema file) and worked perfectly for me. With more detail the steps I followed are (as found in here):

Use Copy table in the BigQuery UI to get a replica of your table without "id". My starting table is followingDML and the copy followingDMLmod.
Copy the schema from your table into a JSON file (here called myschema.json) by running the following command in the Cloud Shell

bq show \
--schema \
--format=prettyjson \
testing.followingDMLmod > myschema.json

Open the schema in a text editor. For example running

vim myschema.json

Now modify the schema to add the new nested column to the end of the fields array. (If you have never used vim, a very simplified explanation would be "esc" returns you to normal mode, and while in normal mode clicking "i" allows you to write into the opened file, ":w" saves the file and ":q" exits the file) I included the field "id":

{
  "mode": "NULLABLE",
  "name": "id",
  "type": "STRING"
}

Now you need to update the schema by running

bq update testing.followingDMLmod myschema.json

Finally, back on the BigQuery UI I used the query

UPDATE `testing.followingDMLmod` 
SET comments = ARRAY(
    SELECT AS STRUCT * REPLACE(GENERATE_UUID() AS id)
    FROM UNNEST(comments)
  )
WHERE true

to populate the id field after creation. Following what is suggested in this stack overflow post. Now the end result is truly what it was expected!

BigQuery: How do I add a field to a REPEATED record?

3 Answers