59
votes

Ok, I have been reading the following Qs from SO regarding x86 CPU fences (LFENCE, SFENCE and MFENCE):

and:

and I must be honest I am still not totally sure when a fence is required. I am trying to understand from the perspective of removing fully-blown locks and trying to use more fine-granular locking via fences, to minimise latency delays.

Firstly here are two specific questions I do not understand:

Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?

CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer. Why can't the MESI protocol just include flushing store buffers as part of its protocol??

More generally, could somebody please attempt to describe the overall scenario and help explain when LFENCE/MFENCE and SFENCE instructions are required?

NB One of the problems reading around this subject is the number of articles written "generally" for multiple CPU architectures, when I am only interested in the Intel x86-64 architecture specifically.

1
"Why can't the MESI protocol just include flushing store buffers as part of its protocol??" If the store buffers had to have strict ordering with respect to the instruction stream, they would serve no purpose. Without such an ordering, when do you flush them? Essentially, your suggestion is "why don't we slow everything down to inter-core speed rather than requiring people to identify the specific things that need to suffer this penalty?"David Schwartz
On x86 you pretty much only need to use fencing if you use memory type other than write-back cached, or if you use non-temporal instructions. See also this answer, and the manual section referenced therein.Jester
Without any explicit synchronization, CPU2 may see the old value even if the store is already buffered in CPU1's store buffer, there's nothing wrong with that. Only once CPU1 makes the store visible, CPU2 "must" see it.Leeor
There's a related post on the Intel forums that mentions usage of MFENCE: software.intel.com/en-us/forums/…jrh

1 Answers

48
votes

The simplest answer: you must use one of 3 fences (LFENCE, SFENCE, MFENCE) to provide one of 6 data Consistency:

  • Relaxed
  • Consume
  • Acquire
  • Release
  • Acquire-Release
  • Sequential

C++11:

Initially, you should consider this problem from the point of view of the degree of order of memory access, which is well documented and standardized in C++11. You should read first: http://en.cppreference.com/w/cpp/atomic/memory_order

x86/x86_64:

1. Acquire-Release Consistency: Then, it is important to understand that in the x86 to access to conventional RAM (marked by default as WB - Write Back, and the same effect with WT (Write Throught) or UC (Uncacheable)) by using asm MOV without any additional commands automatically provides order of memory for Acquire-Release Consistency - std::memory_order_acq_rel. I.e. for this memory makes sense to use only std::memory_order_seq_cst only for provide Sequential Consistency. Ie when you are using: std::memory_order_relaxed or std::memory_order_acq_rel then the compiled assembler code for std::atomic::store() (or std::atomic::load()) will be the same - only MOV without any L/S/MFENCE.

Note: But you must know, that not only CPU but and C++-compiler can reorder operations with memory, and all 6 memory barriers always affect on the C++-compiler regardless of CPU architecture.

Then, you must know, how can it be compiled from C++ to ASM (native machine code) or how can you write it on assembler. To provide any Consistency exclude Sequential you can simple write MOV, for example MOV reg, [addr] and MOV [addr], reg etc.

2. Sequential Consistency: But to provide Sequential Consistency you must use implicit (LOCK) or explicit fences (L/S/MFENCE) as described here: Why GCC does not use LOAD(without fence) and STORE+SFENCE for Sequential Consistency?

  1. LOAD (without fence) and STORE + MFENCE
  2. LOAD (without fence) and LOCK XCHG
  3. MFENCE + LOAD and STORE (without fence)
  4. LOCK XADD ( 0 ) and STORE (without fence)

For example, GCC uses 1, but MSVC uses 2. (But you must know, that MSVS2012 has a bug: Does the semantics of `std::memory_order_acquire` requires processor instructions on x86/x86_64? )

Then, you can read Herb Sutter, your link: https://onedrive.live.com/view.aspx?resid=4E86B0CF20EF15AD!24884&app=WordPdf&authkey=!AMtj_EflYn2507c

Exception to the rule:

This rule is true for access by using MOV to conventional RAM marked by default as WB - Write Back. Memory is marking in the Page Table, in each PTE (Page Table Enrty) for each Page (4 KB continuous memory).

But there are some exceptions:

  1. If we marks memory in Page Table as Write Combined (ioremap_wc() in POSIX), then automaticaly provides only Acquire Consistency, and we must act as in the following paragraph.

  2. See answer to my question: https://stackoverflow.com/a/27302931/1558037

  • Writes to memory are not reordered with other writes, with the following exceptions:
    • writes executed with the CLFLUSH instruction;
    • streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD); and
    • string operations (see Section 8.2.4.1).

In both cases 1 & 2 you must use additional SFENCE between two writes to the same address even if you want Acquire-Release Consistency, because here automaticaly provides only Acquire Consistency and you must do Release (SFENCE) yourself.

Answer to your two questions:

Sometimes when doing a store a CPU will write to its store buffer instead of the L1 cache. I do not however understand the terms on which a CPU will do this?

From the point of view of the user the cache L1 and Store Buffer act differently. L1 fast, but Store-Buffer faster.

  • Store-Buffer - is a simple Queue where stores only Writes, and which can not be reordered - it is made for performance increase and Hide Latency of access to cache (L1 - 1ns, L2 - 3ns, L3 - 10ns) (CPU-Core think that Write has stored to the cache and executes next command, but at the same time your Writes only saved to the Store-Buffer and will be saved to the cache L1/2/3 later), i.e. CPU-Core don't need to wait when Writes will have been stored to cache.

  • Cache L1/2/3 - look like transparent associate array (address - value). It is fast but not the fastest, because x86 automatically provides Acquire-Release Consistency by using cache coherent protocol MESIF/MOESI. It is done for simpler multithread programming, but decrease performance. (Truly, we can use Write Contentions Free algorithms and data structures without using cache coherent, i.e. without MESIF/MOESI for example over PCI Express). Protocols MESIF/MOESI works over QPI which connects Cores in CPU and Cores between different CPUs in multiprocessor systems (ccNUMA).

CPU2 may wish to load a value which has been written in to CPU1's store buffer. As I understand it, the problem is CPU2 cannot see the new value in CPU1's store buffer.

Yes.

Why can't the MESI protocol just include flushing store buffers as part of its protocol??

MESI protocol can't just include flushing store buffers as part of its protocol, because:

  • MESI/MOESI/MESIF protoclos are not related to the Store-Buffer and do not know about it.
  • Automatically flushing Store Buffer at each Writes would decrease performance - and would make it useless.
  • Manualy flushing Store Buffer on all remote CPU-Cores (we don't know on which Core store-buffer contain required Write) by using some command - would decrease performance (in 8 CPUs x 15 Cores = 120 Cores at the same time flush Store-Buffer - this is terrible)

But manualy flushing Store Buffer on current CPU-Core - yes, you can do it by execute SFENCE command. You can use SFENCE in two cases:

  • To provide Sequential Consistency on RAM with Write Back cacheable
  • To provide Acquire-Release Consistency on exceptions of the rule: RAM with Write Combined cacheable, for writes executed with the CLFLUSH instruction and for Non-Temporal SSE/AVX commands

Note:

Do we need LFENCE in any cases on x86/x86_64? - the question is not always clear: Does it make any sense instruction LFENCE in processors x86/x86_64?

Other platform:

Then, you can read as in theory (for a spherical processor in vacuo) with Store-Buffer and Invalidate-Queue, your link: http://www.puppetmastertrading.com/images/hwViewForSwHackers.pdf

And how you can provide Sequential Consistency on other platforms, not only with L/S/MFENCE and LOCK but and with LL/SC: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html