In-order Atom may be able to do this store-forwarding without stalling at all.
Agner Fog doesn't mention this case specifically for Atom, but unlike all other CPUs, it can store-forward with 1c latency from a store to a wider or differently-aligned load. The only exception Agner found was on cache-line boundaries, where Atom is horrible (16 cycle penalty for a CL-split load or store, even when store-forwarding isn't involved).
Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?
There's a terminology issue here. Many people will interpret "Can this load be store-forwarded" as asking if it can happen with as low latency as when all the requirements are met for fast-path store-forwarding, as listed in @IWill's answer. (Where all the loaded data comes from the most recent store to overlap any of the load, and other relative/absolute
alignment rules are met).
I thought at first that you were missing the third possibility, of slower but still (nearly?) fixed latency forwarding without waiting for commit to L1D, e.g. with a mechanism that scrapes the whole store buffer (and maybe loads from L1D) in cases that Agner Fog and Intel's optimization manual call "store forwarding failure".
But now I see this wording was intentional, and you really do want to ask whether or not the third option exists.
You might want to edit some of this into your question. In summary, the three likely options for Intel x86 CPUs are:
- Intel/Agner definition of store-forwarding success, where all the data comes from only one recent store with low and (nearly) fixed latency.
Extra (but limited) latency to scan the whole store buffer and assemble the correct bytes (according to program-order), and (if necessary or always?) load from L1D to provide data for any bytes that weren't recently stored.
This is the option we aren't sure exists.
It also has to wait for all data from store-data uops that don't have their inputs ready yet, since it has to respect program order. There may be some information published about speculative execution with unknown store-address (e.g. guessing that they don't overlap), but I forget.
Wait for all overlapping stores to commit to L1D, then load from L1D.
Some real x86 CPUs might fall back to this in some cases, but they might always use option 2 without introducing a StoreLoad barrier. (Remember that x86 stores have to commit in program order, and loads have to happen in program order. This would effectively drain the store buffer to this point, like mfence
, although later loads to other addresses could still speculatively store-forward or just take data from L1D.)
Evidence for the middle option:
The locking scheme proposed in Can x86 reorder a narrow store with a wider load that fully contains it? would work if store-forwarding failure required a flush to L1D. Since it doesn't work on real hardware without mfence
, that's strong evidence that real x86 CPUs are merging data from the store buffer with data from L1D. So option 2 exists and is used in this case.
See also Linus Torvalds' explanation that x86 really does allow this kind of reordering, in response to someone else who proposed the same locking idea as that SO question.
I haven't tested if store-forwarding failure/stall penalties are variable, but if not that strongly implies that it falls back to checking the whole store buffer when the best-case forwarding doesn't work.
Hopefully someone will answer What are the costs of failed store-to-load forwarding on x86?, which asks exactly that. I will if I get around to it.
Agner Fog only ever mentions a single number for store-forwarding penalties, and doesn't say it's bigger if cache-miss stores are in flight ahead of the stores that failed to forward. (This would cause a big delay, because stores have to commit to L1D in order because of x86's strongly-ordered memory model.) He also doesn't say anything about it being different cases where data comes from 1 store + L1D vs. from parts of two or more stores, so I'd guess that it works in this case, too.
I suspect that "failed" store-forwarding is common enough that it's worth the transistors to handle it faster than just flushing the store queue and reloading from L1D.
For example, gcc doesn't specifically try to avoid store-forwarding stalls, and some of its idioms cause them (e.g. __m128i v = _mm_set_epi64x(a, b);
in 32-bit code stores/reloads to the stack, which is already the wrong strategy on most CPUs for most cases, hence that bug report). It's not good, but the results aren't usually catastrophic, AFAIK.
add cx, 127
(66 opcode modrm imm8
is fine,add cx, 128
(66 opcode modrm imm16
) is not. Also note that recent Intel CPUs don't LCP-stall onmov-immediate
, only with other ALU instructions. (And also that LCP stalls only hurt decode, not the uop cache). – Peter Cordes