The key point is the adverb locally in the quoted sentence "It does not execute until all prior instructions have completed locally".
I was unable to find a clear definition of "complete locally" the whole set of Intel manual, my speculation is explained below.
In order to be completed locally an instruction must have it output computed and available to the other instructions further down in its dependency chain.
Furthermore any side effect of that instruction must be visible inside the core.
In order to be completed globally an instruction must have its side effects visible to other system components (like other CPUs).
If we don't qualify the kind of "completeness" we are talking about it generally means it don't care or it is implicit in the context.
For a lot of instructions being completed locally and globally, it is the same.
For a load for example, in order to be completed locally, some data must be fetched from memory or caches.
This is the same as being completed globally, since we cannot mark the load complete if we don't read from the memory hierarchy first.
For a store however the situation is different.
Intel processors have a Store Buffer to handle writes to memory, from Chapter 11.10 of the manual 3:
Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer
improves processor performance by allowing the processor to continue executing instructions without having to
wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use
of memory-access bus cycles.
So a store can be completed locally by being put in the store buffer, from the core perspective the write is like it have gone all the way to the memory.
A load from the same core of the store, under specific circumstances, can even read back that value (this is called Store Forwarding).
To be completed globally however a store need to be drained from the Store Buffer.
Finally is mandatory to add that the Store Buffer is drained by Serializing instructions:
The contents of the store buffer are always drained to memory in the following situations:
• (P6 and more recent processor families only) When a serializing instruction is executed.
• (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores.
Being done with the introduction, let's see what lfence
, mfence
and sfence
do:
LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes.
MFENCE performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction.
MFENCE does not serialize the instruction stream.
SFENCE performs a serializing operation on all store-to-memory instructions that were issued prior the SFENCE instruction.
So lfence
is weaker form of serialization that doesn't drain the Store Buffer, since it effectively serialize instructions locally, all loads before it must be completed before it completes.
sfence
serializes stores only, it basically doesn't allow the process to execute any more store until sfence
is retired. It also drains the Store buffer.
mfence
is not a simple combination of the two because it is not serializing in the classical sense, it is a sfence
that also prevent future loads to be executed.
It may be worth nothing that sfence
was introduced first and the other twos came later to achieve a more granular control over the memory ordering.
Finally, I was used to close a rdtsc
instruction between two lfence
instructions, to be sure no reordering "backward" and "forward" was possible.
However I'm sure about this technique soundness.