Can modern x86 implementations store-forward from more than one prior store?

Question

In the case that a load overlaps two earlier stores (and the load is not fully contained in the oldest store), can modern Intel or AMD x86 implementations forward from both stores to satisfy the load?

For example, consider the following sequence:

mov [rdx + 0], eax
mov [rdx + 2], eax
mov ax, [rdx + 1]

The final 2-byte load takes its second byte from the immediate preceding store, but its first byte from the store before that. Can this load be store-forwarded, or does it need to wait until both prior stores commit to L1?

Note that by store-forwarding here I'm including any mechanism that can satisfy the reads from stores still in the store buffer, rather than waiting them to commit to L1, even if it is a slower path than the best case "forwards from a single store" case.

Warning: Your use of 16-bit operands probably causes you to take a length-changing prefix penalty on decode, IIRC. — Iwillnotexist Idonotexist
@IwillnotexistIdonotexist: The operand-size prefix is only length-changing for instructions with a 16-bit immediate (which would have been a 32-bit immediate without the prefix). So add cx, 127 (66 opcode modrm imm8 is fine, add cx, 128 (66 opcode modrm imm16) is not. Also note that recent Intel CPUs don't LCP-stall on mov-immediate, only with other ALU instructions. (And also that LCP stalls only hurt decode, not the uop cache). — Peter Cordes
@PeterCordes Ah! So I definitely don't recall correctly :-) It used to be a bigger thing on Core 2, and I still have a Penryn machine. — Iwillnotexist Idonotexist
FWIW, I went with a 16-byte load just so it would fully full contained in both prior stores, whereas a 32-bit load might introduce yet another complication (perhaps not?) because it isn't fully contained in either load (but it is contained in their combination). — BeeOnRope

Iwillnotexist Idonotexist Iwillnotexist Idonotexist · Accepted Answer · 2017-09-10T02:26:51

No.

At least, not on Haswell, Broadwell or Skylake processors. On other Intel processors, the restrictions are either similar (Sandy Bridge, Ivy Bridge) or even tighter (Nehalem, Westmere, Pentium Pro/II/II/4). On AMD, similar limitations apply.

From Agner Fog's excellent optimization manuals:

Haswell/Broadwell

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

The processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding works in the following cases:

When a write of 64 bits or less is followed by a read of the same size and the same address, regardless of alignment.

When a write of 128 or 256 bits is followed by a read of the same size and the same address, fully aligned.

When a write of 64 bits or less is followed by a read of a smaller size which is fully contained in the write address range, regardless of alignment.

When an aligned write of any size is followed by two reads of the two halves, or four reads of the four quarters, etc. with their natural alignment within the write address range.

When an aligned write of 128 bits or 256 bits is followed by a read of 64 bits or less that does not cross an 8 bytes boundary.

A delay of 2 clocks occur if the memory block crosses a 64-bytes cache line boundary. This can be avoided if all data have their natural alignment.

Store forwarding fails in the following cases:

When a write of any size is followed by a read of a larger size

When a write of any size is followed by a partially overlapping read

When a write of 128 bits is followed by a smaller read crossing the boundary between the two 64-bit halves

When a write of 256 bits is followed by a 128 bit read crossing the boundary between the two 128-bit halves

When a write of 256 bits is followed by a read of 64 bits or less crossing any boundary between the four 64-bit quarters

A failed store forwarding takes 10 clock cycles more than a successful store forwarding. The penalty is much higher - approximately 50 clock cycles - after a write of 128 or 256 bits which is not aligned by at least 16.

Emphasis added

Skylake

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

The Skylake processor can forward a memory write to a subsequent read from the same address under certain conditions. Store forwarding is one clock cycle faster than on previous processors. A memory write followed by a read from the same address takes 4 clock cycles in the best case for operands of 32 or 64 bits, and 5 clock cycles for other operand sizes.

Store forwarding has a penalty of up to 3 clock cycles extra when an operand of 128 or 256 bits is misaligned.

A store forwarding usually takes 4 - 5 clock cycles extra when an operand of any size crosses a cache line boundary, i.e. an address divisible by 64 bytes.

A write followed by a smaller read from the same address has little or no penalty.

A write of 64 bits or less followed by a smaller read has a penalty of 1 - 3 clocks when the read is offset but fully contained in the address range covered by the write.

An aligned write of 128 or 256 bits followed by a read of one or both of the two halves or the four quarters, etc., has little or no penalty. A partial read that does not fit into the halves or quarters can take 11 clock cycles extra.

A read that is bigger than the write, or a read that covers both written and unwritten bytes, takes approximately 11 clock cycles extra.

Emphasis added

In General:

A common point across microarchitectures that Agner Fog's document points out is that store forwarding is more likely to happen if the write was aligned and the reads are halves or quarters of the written value.

A Test

A test with the following tight loop:

mov [rsp-16], eax
mov [rsp-12], ebx
mov ecx, [rsp-15]

Shows that the ld_blocks.store_forward PMU counter does indeed increment. This event is documented as follows:

ld_blocks.store_forward [This event counts how many times the load operation got the true Block-on-Store blocking code preventing store forwarding. This includes cases when: - preceding store conflicts with the load (incomplete overlap)

store forwarding is impossible due to u-arch limitations

preceding lock RMW operations are not forwarded

store has the no-forward bit set (uncacheable/page-split/masked stores)

all-blocking stores are used (mostly, fences and port I/O)

This indicates that the store-forwarding does indeed fail when a read only partially overlaps the most recent earlier store (even if it is fully contained when even earlier stores are considered).

Can modern x86 implementations store-forward from more than one prior store?

2 Answers

No.

Haswell/Broadwell

The microarchitecture of Intel and AMD CPUs

§ 10.12 Store forwarding stalls

Skylake

The microarchitecture of Intel and AMD CPUs

§ 11.12 Store forwarding stalls

In General:

A Test

Evidence for the middle option: