13
votes

I have read that some CPUs reorder instructions, but this is not a problem for single threaded programs (the instructions would still be reordered in single threaded programs, but it would appear as if the instructions were executed in order), it is only a problem for multithreaded programs.

To solve the problem of instructions reordering, we can insert memory barriers in the appropriate places in the code.

But does an x86 CPU reorder instructions? If it does not, then there is no need to use memory barriers, right?

1
modern x86 does not only reorder instructions, but translates them into micro-instructions. You need memory barriers when MT even when there's is no instruction reordering, if the writes to memory are not guaranteed to land in the original order, i.e. it depends not only on out-of-order execution of instructions, but also on the memory model, the memory model may be weak enough to reshuffle order of memory changes appearing to other cores. (IIRC the x86 has very "strong" memory model, resolving many of those complexities for programmer, but then x86 is reordering, so you still need barriers).Ped7g
Memory reordering is independent of out-of-order execution. An in-order CPU starts instruction in-order, but they can still complete out of order, and stores are buffered. See preshing.com/20120515/memory-reordering-caught-in-the-act for when you need mfence on x86: only to prevent StoreLoad reordering; AFAIK you still need mfence there on in-order Atom or Pentium CPUs. (But all modern x86 CPUs have fully out-of-order execution.)Peter Cordes
@BeeOnRope: You're right that it's a bit of an overstatement. I should have said that memory-reordering can happen without OoO exec. But really, checking whether a CPU does out-of-order exec is the wrong thing to ask for figuring out where / when you need memory barriers. x86's strong memory model means you don't need barriers in some cases, even with aggressive OoO exec, so again you need to know the memory model, not the exec model.Peter Cordes
Yes, agree 100%. In fact I just realized the original version of my answer was wrong since it read as "Yes, x86 reorders instructions so yes you need memory barriers. ". That's wrong (the so part) and I think what you are getting at above. I changed it so it's like more independent now :). I actually agree they are independent mostly at the ISA/documentation level, but heavily linked at the CPU design uarch level (but OoO reordering isn't the only reason for memory reordering as you point out). @peterBeeOnRope
Now I want to use "independent" in my answer. There has to be a better word which means "is not implied by (or vice-versa), but may be correlated with...".BeeOnRope

1 Answers

28
votes

Reordering

Yes, all modern x86 chips from Intel and AMD aggressively reorder instructions across a window which is around 200 instructions deep on recent CPUs from both manufacturers (i.e. a new instruction may execute while an older instruction more than 200 instructions "in the past" is still waiting). This is generally all invisible to a single thread since the CPU still maintains the illusion of serial execution1 by the current thread by respecting dependencies, so from the point of view of the current thread of execution it is as-if the instructions were executed serially.

Memory Barriers

That should answer the titular question, but then your second question is about memory barriers. It contains, however, an incorrect assumption that instruction reordering necessarily causes (and is the only cause of) visible memory reordering. In fact, instruction reordering is neither sufficient nor necessary for cross-thread memory re-ordering.

Now it is definitely true that out-of-order execution is a primary driver of out-of-order memory access capabilities, or perhaps it is the quest for MLP (Memory Level Parallelism) that drives the increasingly powerful out-of-order abilities for modern CPUs. In fact, both are probably true at once: increasing out-of-order capabilities benefit a lot from strong memory reordering capabilities, and at the same time aggressive memory reordering and overlapping isn't possible without good out-of-order capabilities, so they help each other in kind of a self-reinforcing sum-greater-than-parts kind of loop.

So yes, out-of-order execution and memory reordering certainly have a relationship; however, you can easily get re-ordering without out-of-order execution! For example, a core-local store buffer often causes apparent reordering: at the point of execution the store isn't written directly to the cache (and hence isn't visible at the coherency point), which delays local stores with respect to local loads which need to read their values at the point of execution.

As Peter also points out in the comment thread you can also get a type of load-load reordering when loads are allowed to overlap in an in-order design: load 1 may start but in the absence of an instruction consuming its result a pipelined in-order design may proceed to the following instructions which might include another load 2. If load 2 is a cache hit and load 1 was a cache miss, load 2 might be satisfied earlier in time from load 1 and hence the apparent order may be swapped re-ordered.

So we see that not all cross-thread memory re-ordering is caused by instruction re-ordering, but certain instruction re-ordering also implies out-of-order memory access, right? No so fast! There are two different contexts here: what happens at the hardware level (i.e., whether memory access instructions can, as a practical matter, execute out-of-order), and what is guaranteed by the ISA and platform documentation (often called the memory model applicable to the hardware).

x86 re-ordering

In the case of x86, for example, modern chips will freely re-order more or less any stream of loads and stores with respect to each other: if a load or store is ready to execute, the CPU will usually attempt it, despite the existence of earlier uncompleted load and store operations.

At the same time, x86 defines quite a strict memory model, which bans most possible reorderings, roughly summarized as follows:

  • Stores have a single global order of visibility, observed consistently by all CPUs, subject to one loosening of this rule below.
  • Local load operations are never reordered with respect to other local load operations.
  • Local store operations are never reordered with respect to other local store operations (i.e., a store that appears earlier in the instruction stream always appears earlier in the global order).
  • Local load operations may be reordered with respect to earlier local store operations, such that the load appears to execute earlier wrt the global store order than the local store, but the reverse (earlier load, older store) is not true.

So actually most memory re-orderings are not allowed: loads with respect to each outer, stores with respect to each other, and loads with respect to later stores. Yet I said above that x86 pretty much freely executes out-of-order all memory access instructions - how can you reconcile these two facts?

Well, x86 does a bunch of extra work to track exactly the original order of loads and stores, and makes sure no memory re-orderings that breaks the rules is ever visible. For example, let's say load 2 executes before load 1 (load 1 appears earlier in program order), but that both involved cache lines were in the "exclusively owned" state during the period that load 1 and load 2 executed: there has been reordering, but the local core knows that it cannot be observed because no other was able to peek into this local operation.

In concert with the above optimizations, CPUs also uses speculative execution: execute everything out of order, even if it is possible that at some later point some core can observe the difference, but don't actually commit the instructions until such an observation is impossible. If such an observation does occur, you roll back the CPU to an earlier state and try again. This is cause of the "memory ordering machine clear" on Intel.

So it is possible to define an ISA that doesn't allow any re-ordering at all, but under the covers do re-ordering but carefully check that it isn't observed. PA-RISC is an example of such a sequentially consistent architecture. Intel has a strong memory model that allows one type of reordering, but disallows many others, but each chip internally may do more (or less) re-ordering as long as they can guarantee to play by the rules in an observable sense (in this sense, it somewhat related to the "as-if" rule that compilers play by when it comes to optimizations).

The upshot of all that is that yes, x86 requires memory barriers to prevent specifically the so-called StoreLoad re-ordering (for algorithms that require this guarantee). You don't find many standalone memory barriers in practice in x86, because most concurrent algorithms also need atomic operations, such as atomic add, test-and-set or compare-and-exchange, and on x86 those all come with full barriers for free. So the use of explicit memory barrier instructions like mfence is limited to cases where you aren't also doing an atomic read-modify-write operation.

Jeff Preshing's Memory Reordering Caught in the Act has one example that does show memory reordering on real x86 CPUs, and that mfence prevents it.


1 Of course if you try hard enough, such reordering is visible! An high-impact recent example of that would be the Spectre and Meltdown exploits which exploited speculative out-of-order execution and a cache side channel to violate memory protection security boundaries.