Updating column in large Spanner table with concurrent writes

Question

I want to update a column in a large table (tens of millions of rows), without bringing our application down; i.e. there will be concurrent writes to the table. I want to do this in code (Java) since the update is non-trivial, and am wondering what the best way to do it is.

An efficient approach is to open a read transaction querying all rows, then loop over all these rows (resultSet.next()) while at the same time creating a series of read/write transactions with, say, 10,000 INSERT_OR_UPDATE mutations buffered in each transaction, containing the desired updates.

The problem is that this doesn't handle concurrent writes, since these steps can happen:

The read transaction mentioned above reads row X
Some separate transaction updates row X
The read/write transaction mentioned above uses data from step 1 and overwrites the update from step 2

To fix this, I can read the value back during the read/write transaction and verify that it hasn't changed, similar to this example here, however this seems very slow (~50 ms per call, which translates to several weeks to update the whole table).

So how can I do this more efficiently? Thanks!

You write that you could read the value back during the read/write transaction and verify that it hasn't changed. What is it that you would do if it has changed? Use the changed value in the update? Or abort the entire update? Because the latter is no longer possible as the update is broken into several transactions. If you intend to just use the updated data, then you should follow a different strategy: Start a read/write transaction and read the first 10,000 rows you want to update, update these in the read/write tx and commit. Repeat with the following 10,000 rows until all rows are done. — Knut Olav Løite
Thanks for the response! Just to clarify your suggestion: Do you mean to call executeQuery in the r/w transaction, loop over the resultset, and create and buffer all mutations? That indeed seems to work, and is efficient! And as for pagination, I guess I would have to use the where / order by / limit technique described in this answer? — John
Yes, that's what I meant. I'll add this as an answer as well, so that others that might have a similar question will find this thread easier. — Knut Olav Løite

Knut Olav Løite Knut Olav Løite · Accepted Answer · 2019-07-10T18:17:38

The best way to do that would be not to use a read-only transaction, but to start a read/write transaction for each batch of 10,000 records, read the values you want to update in this read/write transaction and then update these 10,000 records in the same read/write transaction. Repeat this until all records have been updated.

So something like this:

Start read/write transaction
Read a batch of 10,000 records. Make sure that the order of the records is consistent by ordering on the primary key or some other unique (combination of) column(s). Limit the results by using LIMIT and OFFSET, so you get a query like this SELECT * FROM SOME_TABLE WHERE KEY>=@start AND KEY<@end LIMIT 10000
Update the records and commit the transaction.
Repeat until all records have been updated.

Updating column in large Spanner table with concurrent writes

2 Answers