Using an atomic read-modify-write operation in a release sequence

Question

Say, I create an object of type Foo in thread #1 and want to be able to access it in thread #3.
I can try something like:

std::atomic<int> sync{10};
Foo *fp;

// thread 1: modifies sync: 10 -> 11
fp = new Foo;
sync.store(11, std::memory_order_release);

// thread 2a: modifies sync: 11 -> 12
while (sync.load(std::memory_order_relaxed) != 11);
sync.store(12, std::memory_order_relaxed);

// thread 3
while (sync.load(std::memory_order_acquire) != 12);
fp->do_something();

The store/release in thread #1 orders Foo with the update to 11
thread #2a non-atomically increments the value of sync to 12
the synchronizes-with relationship between thread #1 and #3 is only established when #3 loads 11

The scenario is broken because thread #3 spins until it loads 12, which may arrive out of order (wrt 11) and Foo is not ordered with 12 (due to the relaxed operations in thread #2a).
This is somewhat counter-intuitive since the modification order of sync is 10 → 11 → 12

The standard says (§ 1.10.1-6):

an atomic store-release synchronizes with a load-acquire that takes its value from the store (29.3). [ Note: Except in the specified cases, reading a later value does not necessarily ensure visibility as described below. Such a requirement would sometimes interfere with efficient implementation. —end note ]

It also says in (§ 1.10.1-5):

A release sequence headed by a release operation A on an atomic object M is a maximal contiguous subsequence of side effects in the modification order of M, where the first operation is A, and every subsequent operation
- is performed by the same thread that performed A, or
- is an atomic read-modify-write operation.

Now, thread #2a is modified to use an atomic read-modify-write operation:

// thread 2b: modifies sync: 11 -> 12
int val;
while ((val = 11) && !sync.compare_exchange_weak(val, 12, std::memory_order_relaxed));

If this release sequence is correct, Foo is synchronized with thread #3 when it loads either 11 or 12. My questions about the use of an atomic read-modify-write are:

Does the scenario with thread #2b constitute a correct release sequence ?

And if so:

What are the specific properties of a read-modify-write operation that ensure this scenario is correct ?

Do you have any particular reason to doubt that store(11) and compare_exchange(11, 12) constitute a release sequence? They satisfy all the requirements in the paragraph you quoted. — Anton
@user3290797 Well, maybe because I have seen these chains before with RMW's at the end, but never in the middle. You are right, it should be correct per the standard. I guess it is more about the follow-up questions. — LWimsey
@PeterCordes My wording was a bit clumsy, but I agree.. thread #3 may never see 11, that applies to both scenario's 2a and 2b. But in the 2a case, Foo only becomes (reliably) visible when (and if) thread #3 loads 11. If it loads 12, it has become impossible to access Foo because it is is unordered wrt 12, and 11 is 'lost' (I referred to that scenario in the question as 'broken') — LWimsey
Oh right, I lost track of the big picture. Yes, I think in 2b, the RMW preserves causality, because it can't make 12 globally visible before 11 was. So seeing 12 means that Foo is ready. A separate store doesn't have this property in C++11. — Peter Cordes
In asm for real hardware, I think it's usually safe to atomically load, then atomically store something that has a data dependency on the load. But value-prediction for loads is a theoretical possibility that would break this the same way a speculative control dependency does.) C++ rules are conservative here, and disallow anything but atomic-RMW propagating a dependency. — Peter Cordes

BeeOnRope BeeOnRope · Accepted Answer · 2017-09-01T21:41:16

Does the scenario with thread #2b constitute a correct release sequence ?

Yes, per your quote from the standard.

What are the specific properties of a read-modify-write operation that ensure this scenario is correct?

Well, the somewhat circular answer is that the only important specific property is that "The C++ standard defines it so".

As a practical matter, one may ask why the standard defines it like this. I don't think you'll find that the answer has a deep theoretical basis: I think the committee could have also defined it such that the RMW doesn't participate in the release sequence, or (perhaps with more difficulty) have defined so that both the RMW and the separate mo_relaxed load and store participate in the release sequence, without compromising the "soundness" of the model.

They already give a performance related as to why they didn't choose the latter approach:

Such a requirement would sometimes interfere with efficient implementation.

In particular, on any hardware platform that allowed load-store reordering, it would imply that even mo_relaxed loads and/or stores might require barriers! Such platforms exist today. Even on more strongly ordered platforms, it may inhibit compiler optimizations.

So why didn't they take then take other "consistent" approach of not requiring RMW mo_relaxed to participate in the release sequence? Probably because existing hardware implementations of RMW operations provide such guarantees and the nature of RMW operations makes it likely that this will be true in the future. In particular, as Peter points in the comments above, RMW operations, even with mo_relaxed are conceptually and practically¹ stronger than separate loads and stores: they would be quite useless if they didn't have a consistent total order.

Once you accept that is how hardware works, it makes sense from a performance angle to align the standard: if you didn't, you'd have people using more restrictive orderings such as mo_acq_rel just to get the release sequence guarantees, but on real hardware that has weakly ordered CAS, this doesn't come for free.

¹ The "practically" part means that even the weakest forms of RMW instructions are usually relatively "expensive" operations taking a dozen cycles or more on modern hardware, while mo_relaxed loads and stores generally just compile to plain loads and stores in the target ISA.

Using an atomic read-modify-write operation in a release sequence

1 Answers