0
votes

In BigQuery I want to update a column in one of my tables and it seems that doing so costs as much as if I did an FTS (full table scan) on the table.

let's say I have a 1TB table employee having two columns: first_name, last_name (512GB each) and I want to set the firs_name of all the employees to 'john', like below

UPDATE my_dataset.employee
SET first_name = 'john'
WHERE TRUE IS TRUE          # mandatory WHERE clause 

How much does this cost ?

from this documentation link (relevant part below)

DML statement 1: Updating a single partition

mytable2 has two columns: id of type INTEGER and ts of type TIMESTAMP. mytable has two columns: field1 of type INTEGER and field2 of type STRING.

UPDATE project.mydataset.mytable T 
SET T.field1 = T.field1 + 100 
WHERE T._PARTITIONTIME = TIMESTAMP(“2017-05-01”) 
      AND EXISTS (SELECT S.id 
                  FROM project.mydataset.mytable2 S 
                  WHERE S.id = T.field1)

Bytes processed in this example =

sum of the number of bytes in mytable2.id +
sum of the number of bytes in mytable.field1 in the "2017-05-01" partition +
sum of the number of bytes in mytable.field2 in the "2017-05-01" partition

We have that updating one column project.mydataset.mytable for one partition costs as much as reading all the columns (SELECT *) for the same partition. Why is that ?

I've tried the same query both in the UI and the CLI, the price (bytes processed) doesn't change it's equivalent to a (SELECT *) for the same partitions

bq query --dry_run '<my query>'
1

1 Answers

1
votes

Referring to the BigQuery Pricing page:

UPDATE: The sum of bytes processed for all the columns referenced in all partitions for the tables scanned by the query + the sum of bytes for all columns in the updated or scanned partitions for the table being updated (at the time the UPDATE starts).

The second part of the statement says that the user will be billed for the sum of the bytes for all columns in the updated or scanned partitions for the table being updated so it will bill for the size of the table.

Let's take a look for an example in the documentation, it says the bytes billed will include the sum of the bytes in mytable.field2 which is not included in the query. The calculation of the bytes uses specific columns for other tables but bills on the whole table's size, which is being updated (even columns that are not included in the query).

I hope you find the above pieces of information useful.