Why does this `std::atomic_thread_fence` work

Question

Firstly I want to list some of my undertandings regarding to this, please correct me if I'm wrong.

a MFENCE in x86 can ensure a full barrier
Sequential-Consistency prevents reordering of STORE-STORE, STORE-LOAD, LOAD-STORE and LOAD-LOAD

This is according to Wikipedia.
std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

This is according to Alex's answer, "Loads May Be Reordered with Earlier Stores to Different Locations"(for x86) and mfence will not always be added.

Whether a std::memory_order_seq_cst indicates Sequential-Consistency? According to point 2/3, it seems not correct to me. std::memory_order_seq_cst indicates Sequential-Consistency only when
1. at least one explicit MFENCE added to either LOAD or STORE
2. LOAD (without fence) and LOCK XCHG
3. LOCK XADD ( 0 ) and STORE (without fence)
otherwise there will still be possible reorders.

According to @LWimsey's comment, I made a mistake here, if both the LOAD and STORE are memory_order_seq_cst, there's no reorder. Alex may indicated situations where non-atomic or non-SC is used.
std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

This is according to Alex's answer. So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

This is quite strange to me, because a memory_order_seq_cst seems to have quite a difference usage between atomic functions and fence functions.

Now I come to this code in header file of MSVC 2015's standard library, which implements std::atomic_thread_fence

inline void _Atomic_thread_fence(memory_order _Order)
    {   /* force memory visibility and inhibit compiler reordering */
 #if defined(_M_ARM) || defined(_M_ARM64)
    if (_Order != memory_order_relaxed)
        {
        _Memory_barrier();
        }

 #else
    _Compiler_barrier();
    if (_Order == memory_order_seq_cst)
        {   /* force visibility */
        static _Uint4_t _Guard;
        _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst);
        _Compiler_barrier();
        }
 #endif
    }

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE, or what has actually done to enable an equivalent mechanism like MFENCE, because a _Compiler_barrier() is obviously not enough here for a full memory barrier, or this statement works somewhat similar to point 3?

About your point 3 "std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder".. It does guarantee that, but only when both operations are tagged as such. — LWimsey
@LWimsey Do you mean if I use atomic_store(memory_order_seq_cst ) and atomic_load(memory_order_seq_cst ), there'll be no reorder. However if I use atomic_store(memory_order_release) and atomic_load(memory_order_acquire), then I should add a MFENCE to either of them, in order to avoid STORE-LOAD reorder? — calvin
Yes, if you use seq_cst on both the store and the load, all threads will observe both operations in that order. The same for inserting an atomic_thread_fence(seq_cst) in between (You can/should not really insert an MFENCE, leave that to the compiler). — LWimsey
@calvin It actually depends on whether or not you talk about the same memory location. If you do an x.store(1, memory_order_release); x.load(memory_order_acquire); then no fence would needed (although such a construct would be highly questionable, so you probably meant them to be on different memory locations). — Carlo Wood
@LWimsey 1) All threads? Which threads? 2) Fence between what and what? Other threads must use the fence? — curiousguy

Peter Cordes Peter Cordes · Accepted Answer · 2020-04-23T08:38:40

So my major question is how can _Atomic_exchange_4(&_Guard, 0, memory_order_seq_cst); create a full barrier MFENCE

This compiles to an xchg instruction with a memory destination. This is a full memory barrier (draining the store buffer) exactly¹ like mfence.

With compiler barriers before and after that, compile-time reordering around it is also prevented. Therefore all reordering in either direction is prevented (of operations on atomic and non-atomic C++ objects), making it more than strong enough to do everything that ISO C++ atomic_thread_fence(mo_seq_cst) promises.

For orders weaker than seq_cst, only a compiler barrier is needed. x86's hardware memory-ordering model is program-order + a store buffer with store forwarding. That's strong enough for acq_rel without the compiler emitting any special asm instructions, just blocking compile-time reordering. https://preshing.com/20120930/weak-vs-strong-memory-models/

Footnote 1: exactly enough for the purposes of std::atomic. Weakly ordered MOVNTDQA loads from WC memory may not be as strictly ordered by locked instructions as by MFENCE.

Which is a better write barrier on x86: lock+addl or xchgl?
Does lock xchg have the same behavior as mfence? - equivalent for std::atomic purposes, but some minor differences that might matter for a device driver using WC memory regions. And perf differences. Notably on Skylake where mfence blocks OoO exec like lfence
Why is LOCK a full barrier on x86?

Atomic read-modify-write (RMW) operation on x86 are only possible with a lock prefix, or xchg with memory which is like that even without a lock prefix in the machine code. A lock-prefixed instruction (or xchg with mem) is always a full memory barrier.

Using an instruction like lock add dword [esp], 0 as a substitute for mfence is a well-known technique. (And performs better on some CPUs.) This MSVC code is the same idea, but instead of a no-op on whatever the stack pointer is pointing-to, it does an xchg on a dummy variable. It doesn't actually matter where it is, but a cache line that's only ever accessed by the current core and is already hot in cache is the best choice for performance.

Using a static shared variable that all cores will contend for access to is the worst possible choice; this code is terrible! Interacting with the same cache line as other cores is not necessary to control the order of this core's operations on its own L1d cache. This is completely bonkers. MSVC still apparently uses this horrible code in its implementation of std::atomic_thread_fence(), even for x86-64 where mfence is guaranteed available. (Godbolt with MSVC 19.14)

If you're doing a seq_cst store, your options are mov+mfence (gcc does this) or doing the store and the barrier with a single xchg (clang and MSVC do this, so the codegen is fine, no shared dummy var).

Much of the early part of this question (stating "facts") seems wrong and contains some misinterpretations or things that are so misguided they're not even wrong.

std::memory_order_seq_cst makes no guarantee to prevent STORE-LOAD reorder.

C++ guarantees order using a totally different model, where acquire loads that see a value from a release store "synchronize with" it, and later operations in the C++ source are guaranteed to see all the stores from code before the release store.

It also guarantees that there's a total order of all seq_cst operations even across different objects. (Weaker orders allow a thread to reload its own stores before they become globally visible, i.e. store forwarding. That's why only seq_cst has to drain the store buffer. They also allow IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

Concepts like StoreLoad reordering are based on a model where:

All inter-core communication is via committing stores to cache-coherent shared memory
Reordering happens inside one core between its own accesses to cache. e.g. by the store buffer delaying store visibility until after later loads like x86 allows. (Except a core can see its own stores early via store forwarding.)

In terms of this model, seq_cst does require draining the store buffer at some point between a seq_cst store and a later seq_cst load. The efficient way to implement this is to put a full barrier after seq_cst stores. (Instead of before every seq_cst load. Cheap loads are more important than cheap stores.)

On an ISA like AArch64, there are load-acquire and store-release instructions which actually have sequential-release semantics, unlike x86 loads/stores which are "only" regular release. (So AArch64 seq_cst doesn't need a separate barrier; a microarchitecture could delay draining the store buffer unless / until a load-acquire executes while there's still a store-release not committed to L1d cache yet.) Other ISAs generally need a full barrier instruction to drain the store buffer after a seq_cst store.

Of course even AArch64 needs a full barrier instruction for a seq_cst fence, unlike a seq_cst load or store operation.

std::atomic_thread_fence(memory_order_seq_cst) always generates a full-barrier

In practice yes.

So I can always replace asm volatile("mfence" ::: "memory") with std::atomic_thread_fence(memory_order_seq_cst)

In practice yes, but in theory an implementation could maybe allow some reordering of non-atomic operations around std::atomic_thread_fence and still be standards-compliant. Always is a very strong word.

ISO C++ only guarantees anything when there are std::atomic load or store operations involved. GNU C++ would let you roll your own atomic operations out of asm("" ::: "memory") compiler barriers (acq_rel) and asm("mfence" ::: "memory") full barriers. Converting that to ISO C++ signal_fence and thread_fence would leave a "portable" ISO C++ program that has data-race UB and thus no guarantee of anything.

(Although note that rolling your own atomics should use at least volatile, not just barriers, to make sure the compiler doesn't invent multiple loads, even if you avoid the obvious problem of having loads hoisted out of a loop. Who's afraid of a big bad optimizing compiler?).

Always remember that what an implementation does has to be at least as strong as what ISO C++ guarantees. That often ends up being stronger.

Why does this `std::atomic_thread_fence` work

3 Answers