0
votes

I want to update a column in a large table (tens of millions of rows), without bringing our application down; i.e. there will be concurrent writes to the table. I want to do this in code (Java) since the update is non-trivial, and am wondering what the best way to do it is.

An efficient approach is to open a read transaction querying all rows, then loop over all these rows (resultSet.next()) while at the same time creating a series of read/write transactions with, say, 10,000 INSERT_OR_UPDATE mutations buffered in each transaction, containing the desired updates.

The problem is that this doesn't handle concurrent writes, since these steps can happen:

  1. The read transaction mentioned above reads row X
  2. Some separate transaction updates row X
  3. The read/write transaction mentioned above uses data from step 1 and overwrites the update from step 2

To fix this, I can read the value back during the read/write transaction and verify that it hasn't changed, similar to this example here, however this seems very slow (~50 ms per call, which translates to several weeks to update the whole table).

So how can I do this more efficiently? Thanks!

2
You write that you could read the value back during the read/write transaction and verify that it hasn't changed. What is it that you would do if it has changed? Use the changed value in the update? Or abort the entire update? Because the latter is no longer possible as the update is broken into several transactions. If you intend to just use the updated data, then you should follow a different strategy: Start a read/write transaction and read the first 10,000 rows you want to update, update these in the read/write tx and commit. Repeat with the following 10,000 rows until all rows are done.Knut Olav Løite
Thanks for the response! Just to clarify your suggestion: Do you mean to call executeQuery in the r/w transaction, loop over the resultset, and create and buffer all mutations? That indeed seems to work, and is efficient! And as for pagination, I guess I would have to use the where / order by / limit technique described in this answer?John
Yes, that's what I meant. I'll add this as an answer as well, so that others that might have a similar question will find this thread easier.Knut Olav Løite

2 Answers

2
votes

The best way to do that would be not to use a read-only transaction, but to start a read/write transaction for each batch of 10,000 records, read the values you want to update in this read/write transaction and then update these 10,000 records in the same read/write transaction. Repeat this until all records have been updated.

So something like this:

  1. Start read/write transaction
  2. Read a batch of 10,000 records. Make sure that the order of the records is consistent by ordering on the primary key or some other unique (combination of) column(s). Limit the results by using LIMIT and OFFSET, so you get a query like this SELECT * FROM SOME_TABLE WHERE KEY>=@start AND KEY<@end LIMIT 10000
  3. Update the records and commit the transaction.
  4. Repeat until all records have been updated.
0
votes

You don't say how you want to update the column - whether it is setting a constant value, or calculating a value based on other column(s) in the row.

You also don't say what the other updates to the table are, but I assume that they will involve modifications to this column, or modifications to other columns that affect this column..

Either way, partitioned DML is a solution to this... Express your modification in the form of an update statement:

UPDATE table SET col1=123 WHERE col2=TRUE

Then run this as a partitioned DML (using API or gcloud with the --enable-partitioned-dml flag) Spanner will split the operation up into multiple separate transactions, where each transaction would be internally consistent. Each DML transaction will only lock a sub-set of the rows in the table while running the transaction.

One issue with partitioned DML is that the expression will be run at least once on each row - due to retries - so the statement must be idempotent -- ie give the same results when executed multiple times.