At Intel x86/x86_64 systems have 3 types of memory barriers: lfence, sfence and mfence. The question in terms of their use.
For Sequential Semantic (SC) is sufficient to use MOV [addr], reg + MFENCE for all memory cells requiring SC-semantics. However, you can write code in the whole and vice versa: MFENCE + MOV reg, [addr]. Apparently felt, that if the number of stores to memory is usually less than the loads from it, then the use of write-barrier in total cost less. And on this basis, that we must use sequential stores to memory, made another optimization - [LOCK] XCHG, which is probably cheaper due to the fact that "MFENCE inside in XCHG" applies only to the cache line of memory used in XCHG (video where on 0:28:20 said that MFENCE more expensive that XCHG).
http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html
C/C++11 Operation x86 implementation
- Load Seq_Cst: MOV (from memory)
- Store Seq Cst: (LOCK) XCHG // alternative: MOV (into memory),MFENCE
Note: there is an alternative mapping of C/C++11 to x86, which instead of locking (or fencing) the Seq Cst store locks/fences the Seq Cst load:
- Load Seq_Cst: LOCK XADD(0) // alternative: MFENCE,MOV (from memory)
- Store Seq Cst: MOV (into memory)
The difference is that ARM and Power memory barriers interact exclusively with LLC (Last Level Cache), and x86 interact and with lower level caches L1/L2. In x86/x86_64:
lfenceon Core1: (CoreX-L1) -> (CoreX-L2) -> L3-> (Core1-L2) -> (Core1-L1)sfenceon Core1: (Core1-L1) -> (Core1-L2) -> L3-> (CoreX-L2) -> (CoreX-L1)
In ARM:
ldr; dmb;: L3-> (Core1-L2) -> (Core1-L1)dmb; str; dmb;: (Core1-L1) -> (Core1-L2) -> L3
C++11 code compiled by GCC 4.8.2 - GDB in x86_64:
std::atomic<int> a;
int temp = 0;
a.store(temp, std::memory_order_seq_cst);
0x4613e8 <+0x0058> mov 0x38(%rsp),%eax
0x4613ec <+0x005c> mov %eax,0x20(%rsp)
0x4613f0 <+0x0060> mfence
But why on x86/x86_64 Sequential Semantic (SC) using through MOV [addr], reg + MFENCE, and not MOV [addr], reg + SFENCE, why do we need full-fence MFENCE instead of SFENCE there?
SFENCEcannot provide a total ordering that's observed by all CPUs, i.e. why we need to doLFENCEconsisting inMFENCEafter each store-operation (not before load-operation)? - AlexXandYare zero. Now:[Thread 1: STORE X = 1, SFENCE],[Thread 2: STORE Y = 1, SFENCE], and in any other thread, do[LFENCE, LOAD X, LOAD Y]. Now one other thread could seeX = 1, Y = 0, and another could seeX = 0, Y = 1. The fences only tell you that other, earlier stores in Thread 1 have taken effect if you seeX = 1. But there's no global order consistent with that. - Kerrek SB