1
votes

Based on what I have read, a CPU can re-order the execution of instructions, and a memory barrier prevents the re-ordering of instruction from before to after and from after to before the memory barrier.

But there is something that I am not sure of. Say I have the following instructions:

store x
store y

Let's say that the CPU decided to execute store y before store x.

How does the CPU does that, does it completely ignores store x and executes store y first? Or does the following happen?:

  1. store x is executed, but it is not completed immediately (it becomes pending).
  2. store y is executed, and it is completed immediately.
  3. The pending store x is completed.

So basically, this gave the "illusion" that the instructions were executed out of order, even though they didn't, they only completed out of order.


I am asking this question to understand how a memory barrier work.

For example say I have the following instructions:

store x
mfence
store y

Now when the CPU executes these instructions, will the following happen?:

  1. store x is executed, but it is not completed immediately (it becomes pending).
  2. mfence is executed, now since this instruction is a memory barrier, the CPU will make sure that all pending operations before it (store x) will be completed before continuing with the execution of instructions.
  3. store y is executed.
2
The sole point of out of order is to actually execute out of order, and the illusion is that they executed in order. Note that there are rules, one of them being: Writes to memory are not reordered with other writes (assuming WB and no fancy stuff like explicit non-temporal) - Jester
With all of the peripherals that require setup before a write to say go/enable/run, etc, out of order writes would be a disaster. - old_timer
think a=b+c; d=e+f; h=5; g=a+d. the d= could happen before the a= and everything would be fine, or perhaps move the h=5 around. Think if some register is busy and there is something else that isnt, that can cut in line without changing the functionality of the program, then run that. - old_timer
dont know about x86 but on arm you use memory barriers for things like flushing the write buffer or invalidating the cache, before letting anyone else have any memory operations invalidate the cache and basically finish any pending memory transactions. data barrier would basically say finish any data transactions in flight or in the queue. and an instruction barrier would say finish out the pipe before moving on. - old_timer
search through open source projects (like linux) and see where they use a memory barrier, and where they dont...It should start to shed a light on your confusion. - old_timer

2 Answers

3
votes

mfence does not prevent out-of-order execution.
It merely ensures all memory loads and stores preceeding the mfence are all serialized prior to executing any memory loads or stores after the mfence.

See: http://x86.renejeschke.de/html/file_module_x86_id_170.html

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes in program order the MFENCE instruction is globally visible before any load or store instruction that follows the MFENCE instruction is globally visible.

X86 does is limited in OoO memory accesses in any case
The x86 architecture does have a few memory ordering rules built-in already.
The gist of this is that memory accesses receive very little reordering.

Here's the official write-up from Intel: http://www.cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdf

The gist has been most helpfully listed in the index :-)

Memory ordering for write-back (WB) memory
* Loads are not reordered with other loads and stores are not reordered with other stores
* Stores are not reordered with older loads
* Loads may be reordered with older stores to different locations
[...]
* Loads and stores are not reordered with locks

Back to your questions

Does the CPU actually executes an instruction before the other when re-ordering
Yes, you can see this when timing the code.

Let me give you an example, let's assume we have an AMD jaguar which can execute 2 instructions in parallel and has full OoO.

a: mov ebx,[eax]      //1 cycle throughput
b: mov ecx,2          //pairs
c: imul eax,edx       //3 cycles latency
d: add eax,ebp        //1 cycle, needs to wait for c

Normally this snippet would take 1+3+1 = 5 cycles. However, the CPU will execute this in the following order:

c: imul eax,edx      //3 cycle latency
a: mov ebx,[eax']    //pairs, eax is renamed to eax' in the register rename buffer
b: mov ecx,2         //1 cycle
d: add eax,ebp       //1 cycle waits for c

This only takes 4 cycles. 3 for a and 1 for d, all the rest gets interleaved.
There obviously is space to squeeze more instructions between c and d and the CPU will do so if it has any instructions that are applicable.

Note that the CPU reorders a memory load, as long is it's not relative to another memory load (and a few other restrictions, see above).
Also note that AMD and Intel follow the exact same semantics.

0
votes

On a super-scalar processor, you can have operations queued up waiting for previous instructions to complete. Imagine code like this:

...
div %esi        # divide edx:eax by esi
mov %eax,(%ebx) # store quotient in (%ebx)
mov $1,(%ecx)   # store 1 in (%ecx)

On a super-scalar processor, the first mov instruction will be encountered right after the div instruction is dispatched. However, at that time div hasn't finished yet. Thus the store instruction is queued in the instruction queue until the result of div %esi is available in %eax. In the next cycle, the processor encounters mov $1,(%ecx). Since the immediate $1 is immediately available, the processor doesn't have to wait and can execute the store immediately. Some time after the store has been dispatched, the div instruction finishes, causing the store to be released from the instruction queue and executed.

This is how it happens that stores occur in a different order than the machine code specifies. The CPU has extra logic to ensure that this detail isn't usually visible to the programmer, but depending on what architecture you program for, different artifacts can exist.