Optimistic locking and re-try

Question

I'm not sure about proper design of an approach.

We use optimistic locking using long incremented version placed on every entity. Each update of such entity is executed via compare-and-swap algorithm which just succeed or fail depending on whether some other client updates entity in the meantime or not. Classic optimistic locking as e.g. hibernate do.

We also need to adopt re-trying approach. We use http based storage (etcd) and it can happen that some update request is just timeouted.

And here it's the problem. How to combine optimistic locking and re-try. Here is the specific issue I'm facing.

Let say I have an entity having version=1 and I'm trying to update it. Next version is obviously 2. My client than executes conditional update. It's successfully executed only when the version in persistence is 1 and it's atomically updated to version=2. So far, so good.

Now, let say that a response for the update request does not arrive. It's impossible to say if it succeeded or not at this moment. The only thing I can do now is to re-try the update again. In memory entity still contains version=1 intending to update value to 2.

The real problem arise now. What if the second update fails because a version in persistence is 2 and not 1?

There is two possible reasons:

first request indeed caused the update - the operation was successful but the response got lost or my client timeout, whatever. It just did not arrived but it passed
some other client performed the update concurrently on the background

Now I can't say what is true. Did my client update the entity or some other client did? Did the operation passed or failed?

Current approach we use just compares persisted entity and the entity in main memory. Either as java equal or json content equality. If they are equal, the update methods is declared as successful. I'm not satisfied with the algorithm as it's not both cheap and reasonable for me.

Another possible approach is to do not use long version but timestamp instead. Every client generates own timestamp within the update operation in the meaning that potential concurrent client would generate other in high probability. The problem for me is the probability, especially when two concurrent updates would come from same machine.

Is there any other solution?

Matthias Urlichs Matthias Urlichs · Accepted Answer · 2015-11-26T15:03:57

You can fake transactions in etcd by using a two-step protocol.

Algorithm for updating:

First phase: record the update to etcd

add an "update-lock" node with a fairly small TTL. If it exists, wait until it disappears and try again.
add a watchdog to your code. You MUST abort if performing the next steps takes longer than the lock's TTL (or if you fail to refresh it).
add a "update-plan" node with [old,new] values. Its structure is up to you, but you need to ensure that the old values are copied while you hold the lock.
add a "committed-update" node. At this point you have "atomically" updated the data.

Second phase: perform the actual update

read the "planned-update" node and apply the changes it describes.
- If a change fails, verify that the new value is present.
- If it's not, you have a major problem. Bail out.
delete the committed-update node
delete the update-plan node
delete the update-lock node

If you want to read consistent data:

While there is no committed-update node, your data are OK.
Otherwise, wait for it to get deleted.
- Whenever committed-update is present but update-lock is not, initiate recovery.

Transaction recovery, if you find an update-plan node without a lock:

Get the update-lock.
if there is no committed-update node, delete the plan and release the lock.
Otherwise, continue at "Second phase", above.

Optimistic locking and re-try

2 Answers