Does the CPU actually executes an instruction before the other when re-ordering, or is it only the end result that gives this "illusion"?

Question

Based on what I have read, a CPU can re-order the execution of instructions, and a memory barrier prevents the re-ordering of instruction from before to after and from after to before the memory barrier.

But there is something that I am not sure of. Say I have the following instructions:

store x
store y

Let's say that the CPU decided to execute store y before store x.

How does the CPU does that, does it completely ignores store x and executes store y first? Or does the following happen?:

store x is executed, but it is not completed immediately (it becomes pending).
store y is executed, and it is completed immediately.
The pending store x is completed.

So basically, this gave the "illusion" that the instructions were executed out of order, even though they didn't, they only completed out of order.

I am asking this question to understand how a memory barrier work.

For example say I have the following instructions:

store x
mfence
store y

Now when the CPU executes these instructions, will the following happen?:

store x is executed, but it is not completed immediately (it becomes pending).
mfence is executed, now since this instruction is a memory barrier, the CPU will make sure that all pending operations before it (store x) will be completed before continuing with the execution of instructions.
store y is executed.

The sole point of out of order is to actually execute out of order, and the illusion is that they executed in order. Note that there are rules, one of them being: Writes to memory are not reordered with other writes (assuming WB and no fancy stuff like explicit non-temporal) — Jester
With all of the peripherals that require setup before a write to say go/enable/run, etc, out of order writes would be a disaster. — old_timer
think a=b+c; d=e+f; h=5; g=a+d. the d= could happen before the a= and everything would be fine, or perhaps move the h=5 around. Think if some register is busy and there is something else that isnt, that can cut in line without changing the functionality of the program, then run that. — old_timer
dont know about x86 but on arm you use memory barriers for things like flushing the write buffer or invalidating the cache, before letting anyone else have any memory operations invalidate the cache and basically finish any pending memory transactions. data barrier would basically say finish any data transactions in flight or in the queue. and an instruction barrier would say finish out the pipe before moving on. — old_timer
search through open source projects (like linux) and see where they use a memory barrier, and where they dont...It should start to shed a light on your confusion. — old_timer

Johan Johan · Accepted Answer · 2017-03-11T10:39:21

mfence does not prevent out-of-order execution.
It merely ensures all memory loads and stores preceeding the mfence are all serialized prior to executing any memory loads or stores after the mfence.

See: http://x86.renejeschke.de/html/file_module_x86_id_170.html

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible.

X86 does is limited in OoO memory accesses in any case
The x86 architecture does have a few memory ordering rules built-in already.
The gist of this is that memory accesses receive very little reordering.

Here's the official write-up from Intel: http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

The gist has been most helpfully listed in the index :-)

Memory ordering for write-back (WB) memory
* Loads are not reordered with other loads and stores are not reordered with other stores
* Stores are not reordered with older loads
* Loads may be reordered with older stores to different locations
[...]
* Loads and stores are not reordered with locks

Back to your questions

Does the CPU actually executes an instruction before the other when re-ordering
Yes, you can see this when timing the code.

Let me give you an example, let's assume we have an AMD jaguar which can execute 2 instructions in parallel and has full OoO.

a: mov ebx,[eax]      //1 cycle throughput
b: mov ecx,2          //pairs
c: imul eax,edx       //3 cycles latency
d: add eax,ebp        //1 cycle, needs to wait for c

Normally this snippet would take 1+3+1 = 5 cycles. However, the CPU will execute this in the following order:

c: imul eax,edx      //3 cycle latency
a: mov ebx,[eax']    //pairs, eax is renamed to eax' in the register rename buffer
b: mov ecx,2         //1 cycle
d: add eax,ebp       //1 cycle waits for c

This only takes 4 cycles. 3 for a and 1 for d, all the rest gets interleaved.
There obviously is space to squeeze more instructions between c and d and the CPU will do so if it has any instructions that are applicable.

Note that the CPU reorders a memory load, as long is it's not relative to another memory load (and a few other restrictions, see above).
Also note that AMD and Intel follow the exact same semantics.

Does the CPU actually executes an instruction before the other when re-ordering, or is it only the end result that gives this "illusion"?

2 Answers