I want to update a column in a large table (tens of millions of rows), without bringing our application down; i.e. there will be concurrent writes to the table. I want to do this in code (Java) since the update is non-trivial, and am wondering what the best way to do it is.
An efficient approach is to open a read transaction querying all rows, then loop over all these rows (resultSet.next()
) while at the same time creating a series of read/write transactions with, say, 10,000 INSERT_OR_UPDATE
mutations buffered in each transaction, containing the desired updates.
The problem is that this doesn't handle concurrent writes, since these steps can happen:
- The read transaction mentioned above reads row X
- Some separate transaction updates row X
- The read/write transaction mentioned above uses data from step 1 and overwrites the update from step 2
To fix this, I can read the value back during the read/write transaction and verify that it hasn't changed, similar to this example here, however this seems very slow (~50 ms per call, which translates to several weeks to update the whole table).
So how can I do this more efficiently? Thanks!
executeQuery
in the r/w transaction, loop over the resultset, and create and buffer all mutations? That indeed seems to work, and is efficient! And as for pagination, I guess I would have to use the where / order by / limit technique described in this answer? – John