It makes some sense to call StoreLoad reordering an effect of the store buffer because the way to prevent it is with mfence
or a lock
ed instruction that drains the store buffer before later loads are allowed to read from cache. Merely serializing execution (with lfence
) would not be sufficient, because the store buffer still exists. Note that even sfence ; lfence
isn't sufficient.
Also I assume P5 Pentium (in-order dual-issue) has a store buffer, so SMP systems based on it could have this effect, in which case it would definitely be due to the store buffer. IDK how thoroughly the x86 memory model was documented in the early days before PPro even existed, but any naming of litmus tests done before that might well reflect in-order assumptions. (And naming after might include still-existing in-order systems.)
You can't tell which effect caused StoreLoad reordering. It's possible on a real x86 CPU (with a store buffer) for a later load to execute before the store has even written its address and data to the store buffer.
And yes, executing a store just means writing to the store buffer; it can't commit from the SB to L1d cache and become visible to other cores until after the store retires from the ROB (and thus is known to be non-speculative).
(Retirement happens in-order to support "precise exceptions". Otherwise, chaos ensues and discovering a mis-predict might mean rolling back the state of other cores, i.e. a design that's not sane. Can a speculatively executed CPU branch contain opcodes that access RAM? explains why a store buffer is necessary for OoO exec in general.)
I can't think of any detectable side-effect of the load uop executing before the store-data and/or store-address uops, or before the store retires, rather than after the store retires but before it commits to L1d cache.
You could force the latter case by putting an lfence
between the store and the load, so the reordering is definitely caused by the store buffer. (A stronger barrier like mfence, a locked instruction, or a serializing instruction like cpuid
, will all block the reordering entirely by draining the store buffer before the later load can execute. As an implementation detail, before it can even issue.)
A normal out of order exec treats all instructions as speculative, only becoming non-speculative when they retire from the ROB, which is done in program order to support precise exceptions. (See Out-of-order execution vs. speculative execution for a more in-depth exploration of that idea, in the context of Intel's Meltdown vulnerability.)
A hypothetical design with OoO exec but no store buffer would be possible. It would perform terribly, with each store having to wait for all previous instructions to be definitively known to not fault or otherwise be mispredicted / mis-speculated before the store can be allowed to execute.
This is not quite the same thing as saying that they need to have already executed, though (e.g. just executing the store-address uop of an earlier store would be enough to know it's non-faulting, or for a load doing the TLB/page-table checks will tell you it's non-faulting even if the data hasn't arrived yet). However, every branch instruction would need to be already executed (and known-correct), as would every ALU instruction like div
that can.
Such a CPU also doesn't need to stop later loads from running before stores. A speculative load has no architectural effect / visibility, so it's ok if other cores see a share-request for a cache line which was the result of a mis-speculation. (On a memory region whose semantics allow that, such as normal WB write-back cacheable memory). That's why HW prefetching and speculative execution work in normal CPUs.
The memory model even allows StoreLoad ordering, so we're not speculating on memory ordering, only on the store (and other intervening instructions) not faulting. Which again is fine; speculative loads are always fine, it's speculative stores that we must not let other cores see. (So we can't do them at all if we don't have a store buffer or some other mechanism.)
(Fun fact: real x86 CPUs do speculate on memory ordering by doing loads out of order with each other, depending on addresses being ready or not, and on cache hit/miss. This can lead to memory order mis-speculation "machine clears" aka pipeline nukes (machine_clears.memory_ordering
perf event) if another core wrote to a cache line between when it was actually read and the earliest the memory model said we could. Or even if we guess wrong about whether a load is going to reload something stored recently or not; memory disambiguation when addresses aren't ready yet involves dynamic prediction so you can provoke machine_clears.memory_ordering
with single-threaded code.)
Out-of-order exec in P6 didn't introduce any new kinds of memory re-ordering because that could have broken existing multi-threaded binaries. (At that time mostly just OS kernels, I'd guess!) That's why early loads have to be speculative if done at all. x86's main reason for existence it backwards compat; back then it wasn't the performance king.
Re: why this litmus test exists at all, if that's what you mean?
Obviously to highlight something that can happen on x86.
Is StoreLoad reordering important? Usually it's not a problem; acquire / release synchronization is sufficient for most inter-thread communication about a buffer being ready to read, or more generally a lock-free queue. Or to implement mutexes. ISO C++ only guarantees that mutexes lock / unlock are acquire and release operations, not seq_cst.
It's pretty rare that an algorithm depends on draining the store buffer before a later load.
Say I somehow observed this litmus test on an x86 machine,
Fully working program that verifies that this reordering is possible in real life on real x86 CPUs: https://preshing.com/20120515/memory-reordering-caught-in-the-act/. (The rest of Preshing's articles on memory ordering are also excellent. Great for getting a conceptual understanding of inter-thread communication via lockless operations.)