So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
create a full barrier MFENCE
This compiles to an xchg
instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly1 like mfence
.
With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst)
promises.
For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel
without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/
Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by lock
ed instructions as by MFENCE.
Atomic read-modify-write (RMW) operation on x86 are only possible with a lock
prefix, or xchg
with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.
Using an instruction like lock add dword [esp], 0
as a substitute for mfence
is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg
on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.
Using a static
shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence()
, even for x86-64 where mfence
is guaranteed available. (Godbolt with MSVC 19.14)
If you're doing a seq_cst store, your options are mov
+mfence
(gcc does this) or doing the store and the barrier with a single xchg
(clang and MSVC do this, so the codegen is fine, no shared dummy var).
Much of the early part of this question (stating "facts") seems wrong and contains some misinterpretations or things that are so misguided they're not even wrong.
std::memory_order_seq_cst
makes no guarantee to prevent STORE-LOAD reorder.
C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.
It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)
Concepts like StoreLoad reordering are based on a model where:
- All inter-core communication is via committing stores to cache-coherent shared memory
- Reordering happens inside one core between its own accesses to cache. e.g. by the store buffer delaying store visibility until after later loads like x86 allows. (Except a core can see its own stores early via store forwarding.)
In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)
On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.
Of course even AArch64 needs a full barrier instruction for a seq_cst
fence, unlike a seq_cst
load or store operation.
std::atomic_thread_fence(memory_order_seq_cst)
always generates a full-barrier
In practice yes.
So I can always replace asm volatile("mfence" ::: "memory")
with std::atomic_thread_fence(memory_order_seq_cst)
In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence
and still be standards-compliant. Always is a very strong word.
ISO C++ only guarantees anything when there are std::atomic
load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory")
compiler barriers (acq_rel) and asm("mfence" ::: "memory")
full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.
(Although note that rolling your own atomics should use at least volatile
, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).
Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.
atomic_store(memory_order_seq_cst )
andatomic_load(memory_order_seq_cst )
, there'll be no reorder. However if I useatomic_store(memory_order_release)
andatomic_load(memory_order_acquire)
, then I should add aMFENCE
to either of them, in order to avoid STORE-LOAD reorder? – calvinseq_cst
on both thestore
and theload
, all threads will observe both operations in that order. The same for inserting anatomic_thread_fence(seq_cst)
in between (You can/should not really insert anMFENCE
, leave that to the compiler). – LWimseyx.store(1, memory_order_release); x.load(memory_order_acquire);
then no fence would needed (although such a construct would be highly questionable, so you probably meant them to be on different memory locations). – Carlo Wood