Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?

Question

At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use. For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG).

http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

C/C++11 Operation x86 implementation

Load Seq_Cst: MOV (from memory)

Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE

Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:

Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)

Store Seq Cst: MOV (into memory)

The difference is that ARM and Power memory barriers interact exclusively with LLC (Last Level Cache), and x86 interact and with lower level caches L1/L2. In x86/x86_64:

lfence on Core1: (CoreX-L1) -> (CoreX-L2) -> L3-> (Core1-L2) -> (Core1-L1)
sfence on Core1: (Core1-L1) -> (Core1-L2) -> L3-> (CoreX-L2) -> (CoreX-L1)

In ARM:

ldr; dmb;: L3-> (Core1-L2) -> (Core1-L1)
dmb; str; dmb;: (Core1-L1) -> (Core1-L2) -> L3

C++11 code compiled by GCC 4.8.2 - GDB in x86_64:

std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8  <+0x0058>         mov    0x38(%rsp),%eax
0x4613ec  <+0x005c>         mov    %eax,0x20(%rsp)
0x4613f0  <+0x0060>         mfence

But why on x86/x86_64 Sequential Semantic (SC) using through MOV [addr], reg + MFENCE, and not MOV [addr], reg + SFENCE, why do we need full-fence MFENCE instead of SFENCE there?

I think a store fence would only synchronize with other loads, not with other stores. Sequential consistency means that you want a total order that's observed by all CPUs, and a store fence wouldn't imply an ordering of multiple stores. — Kerrek SB
@Kerrek This is true for ARM, but not for x86, since if we make SFENCE on the first CPU-core, then we no longer have to do LFENCE on the other CPU-core before access to this memory cell. Accordingly, if all the variables require sequential semantics (SC) we do SFENCE, and we do not need to have anywhere LFENCE. Or do you mean that MFENCE cancels reordering (out of order execution) in both directions in the processor pipeline? — Alex
First and foremost I think I want to say that sfence alone cannot provide a total ordering that's obseved by all CPUs... — Kerrek SB
@Kerrek SB Sequential semantic and total ordering that's observed by all CPUs are the synonyms. But question is why after each store-operation SFENCE cannot provide a total ordering that's observed by all CPUs, i.e. why we need to do LFENCE consisting in MFENCE after each store-operation (not before load-operation)? — Alex
So, I think the following could happen. Suppose X and Y are zero. Now: [Thread 1: STORE X = 1, SFENCE], [Thread 2: STORE Y = 1, SFENCE], and in any other thread, do [LFENCE, LOAD X, LOAD Y]. Now one other thread could see X = 1, Y = 0, and another could see X = 0, Y = 1. The fences only tell you that other, earlier stores in Thread 1 have taken effect if you see X = 1. But there's no global order consistent with that. — Kerrek SB

Peter Cordes Peter Cordes · Accepted Answer · 2019-03-19T01:25:52

sfence doesn't block StoreLoad reordering. Unless there are any NT stores in flight, it's architecturally a no-op. Stores already wait for older stores to commit before they themselves commit to L1d and become globally visible, because x86 doesn't allow StoreStore reordering. (Except for NT stores / stores to WC memory)

For seq_cst you need a full barrier to flush the store buffer / make sure all older stores are globally visible before any later loads. See https://preshing.com/20120515/memory-reordering-caught-in-the-act/ for an example where failing to use mfence in practice leads to non-sequentially-consistent behaviour, i.e. memory reordering.

As you found, it is possible to map seq_cst to x86 asm with full barriers on every seq_cst load instead of on every seq_cst store / RMW. In that case you wouldn't need any barrier instructions on stores (so they'd have release semantics), but you'd need mfence before every atomic::load(seq_cst).

Why Sequential Semantic on x86/x86_64 is using through MOV [addr], reg + MFENCE instead of + SFENCE?

2 Answers