(I started writing this a while ago but got stalled; I'm not sure it adds up to a full answer, but thought some of this might be worth posting. I think @LWimsey's comments do a better job of getting to the heart of an answer than what I wrote.)
Yes, it's safe.
Keep in mind that the way the as-if rule applies is that execution on the real machine has to always produce a result that matches one possible execution on the C++ abstract machine. It's legal for optimizations to make some executions that the C++ abstract machine allows impossible on the target. Even compiling for x86 at all makes all IRIW reordering impossible, for example, whether the compiler likes it or not. (See below; some PowerPC hardware is the only mainstream hardware that can do it in practice.)
I think the reason that wording is there for RMWs specifically is that it ties the load to the "modification order" which ISO C++ requires to exist for each atomic object separately. (Maybe.)
Remember that the way C++ formally defines its ordering model is in terms of synchronizes-with, and existence of a modification order for each object (that all threads can agree on). Not like hardware where there is a notion of coherent caches1 creating a single coherent view of memory that each core accesses. The existence of coherent shared memory (typically using MESI to maintain coherence at all times) makes a bunch of things implicit, like the impossibility of reading "stale" values. (Although HW memory models do typically document it explicitly like C++ does).
Thus the transformation is safe.
ISO C++ does mention the concept of coherency in a note in another section: http://eel.is/c++draft/intro.races#14
The value of an atomic object M, as determined by evaluation B, shall be the value stored by some side effect A that modifies M, where B does not happen before A.
[Note 14: The set of such side effects is also restricted by the rest of the rules described here, and in particular, by the coherence requirements below. — end note]
...
[Note 19: The four preceding coherence requirements effectively
disallow compiler reordering of atomic operations to a single object,
even if both operations are relaxed loads. This effectively makes the
cache coherence guarantee provided by most hardware available to C++
atomic operations. — end note]
[Note 20: The value observed by a load of an atomic depends on the
“happens before” relation, which depends on the values observed by
loads of atomics. The intended reading is that there must exist an
association of atomic loads with modifications they observe that,
together with suitably chosen modification orders and the “happens
before” relation derived as described above, satisfy the resulting
constraints as imposed here. — end note]
So ISO C++ itself notes that cache coherence gives some ordering, and x86 has coherent caches. (I'm not making a complete argument that this is enough ordering, sorry. LWimsey's comments about what it even means to be the latest in a modification order are relevant.)
(On many ISAs (but not all), the memory model also rules out IRIW reordering when you have stores to 2 separate objects. (e.g. on PowerPC, 2 reader threads can disagree about the order of 2 stores to 2 separate objects). Very few implementations can create such reordering: if shared cache is the only way data can get between cores, like on most CPUs, that creates an order for stores.)
Is it possible for this transformation to cause the thread performing the RMW operation to instead load values older than the latest one?
On x86 specifically, it's very easy to reason about. x86 has a strongly-ordered memory model (TSO = Total Store Order = program order + a store buffer with store-forwarding).
Footnote 1: All cores that std::thread
can run across have coherent caches. True on all real-world C++ implementations across all ISAs, not just x86-64. There are some heterogeneous boards with separate CPUs sharing memory without cache coherency, but ordinary C++ threads of the same process won't be running across those different cores. See this answer for more details about that.
foo
For the sake of argument, the store buffer in A takes 10 min to flush, B takes 20 min and C 30 min At time t=0, A performs store(1), B store(2) and C store(3). After that they continue to load values fromfoo
. When t<10, all threads loadfoo
from their own store buffer (A=1, B=2, C=3) ..... – LWimseyfoo
from the L1 cache, but still gets 1. threads B and C continue to perform a store buffer bypass and load values 2 and 3 respectively. At t=20, store buffer B is flushed which causes A and B to load 2, while C still loads 3. At t=30, store buffer C is flushed and all threads load 3. From the beginning, A has observed 1,2,3. B has observed 2,3 while C has only observed 3 This is fully compliant with the concept of single modification order – LWimsey