Can memory store be reordered really, in an OoOE processor?

Question

We know that two instructions can be reordered by an OoOE processor. For example, there are two global variables shared among different threads.

int data;
bool ready;

A writer thread produce data and turn on a flag ready to allow readers to consume that data.

data = 6;
ready = true;

Now, on an OoOE processor, these two instructions can be reordered (instruction fetch, execution). But what about the final commit/write-back of the results? i.e., will the store be in-order?

From what I learned, this totally depends on a processor's memory model. E.g., x86/64 has a strong memory model, and reorder of stores is disallowed. On the contrary, ARM typically has a weak model where store reordering can happen (along with several other reorderings).

Also, the gut feeling tells me that I am right because otherwise we won't need a store barrier between those two instructions as used in typical multi-threaded programs.

But, here is what our wikipedia says:

.. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data.

OoOE processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal.

I'm confused. Is it saying that the results have to be written back in-order? Really, in an OoOE processor, can store to data and ready be reordered?

Well, it depends on your settings in the page table. For MMIO it would be fatal if the stores would be reordered. For simples accesses to the stack it can reorder the writes, simply by not flushing the L1 cache. The OS won't notice anything of course. — kay
It's whatever the machine documentation says it is (assuming the documentation is correct). Out-of-order stores are possible on many architectures. And this is even before you take into account cache-to-memory pushing and cache synchronization between CPUs. — Hot Licks
@Kay Your point regarding MMIO is correct. Reordering the writes can have an observable behavior to other threads, which programmers are really concerned about. — Eric Z
@HotLicks Yes, theoretically problems can occur even on machines w/o any cache. That's why cache coherence is often separated when we talk about memory model. — Eric Z

Brian Brian · Accepted Answer · 2014-08-15T18:08:51

The consistency model (or memory model) for the architecture determines what memory operations can be reordered. The idea is always to achieve the best performance from the code, while preserving the semantics expected by the programmer. That is the point from wikipedia, the memory operations appear in order to the programmer, even though they may have been reordered. Reordering is generally safe when the code is single-threaded, as the processor can easily detect potential violations.

On x86, the common model is that writes are not reordered with other writes. Yet, the processor is using out of order execution (OoOE), so instructions are being reordered constantly. Generally, the processor has several additional hardware structures to support OoOE, like a reorder buffer and load-store queue. The reorder buffer ensures that all instructions appear to execute in order, such that interrupts and exceptions break a specific point in the program. The load-store queue functions similarly, in that it can restore the order of memory operations according to the memory model. The load-store queue also disambiguates addresses, so that the processor can identify when the operations are made to the same or different addresses.

Back to OoOE, the processor is executing 10s to 100s of instructions in every cycle. Loads and stores are computing their addresses, etc. The processor may prefetch the cache lines for the accesses (which may include cache coherence), but it cannot actually access the line either to read or write until it is safe (according to the memory model) to do so.

Inserting store barriers, memory fences, etc tell both the compiler and processor about further restrictions to reordering the memory operations. The compiler is part of implementing the memory model, as some languages like java have specific memory model, while others like C obey the "memory accesses should appear as if they were executed in order".

In conclusion, yes, data and ready can be reordered in an OoOE. But it depends on the memory model as to whether they actually are. So if you need a specific order, provide the appropriate indication using barriers, etc such that the compiler, processor, etc will not choose a different order for higher performance.

Can memory store be reordered really, in an OoOE processor?

4 Answers