2
votes

I'm trying to understand how the "fetch" phase of the CPU pipeline interacts with memory.

Let's say I have these instructions:

4:  bb 01 00 00 00          mov    $1,%ebx
9:  bb 02 00 00 00          mov    $2,%ebx
e:  b3 03                   mov    $3,%bl

What happens if CPU1 writes 00 48 c7 c3 04 00 00 00 to memory address 8 (i.e. 64-bit aligned) while CPU2 is executing these same instructions? The instruction stream would atomically change from 2 instructions to 1 like this:

4:  bb 01 00 00 00          mov    $1,%ebx
9:  48 c7 c3 04 00 00 00    mov    $4,%rbx

Since CPU1 is writing to the same memory that CPU2 is reading from, there's contention. Would the write cause the CPU2 pipeline to stall while it refreshes its L1 cache? Let's say that CPU2 has just completed the "fetch" pĥase for mov $2, would that be discarded in order to re-fetch the updated memory?

Additionally there's the issue of atomicity when changing 2 instructions into 1.

I found this quite old document that mentions "The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1, even if they share the same cache line. But I don't know if/how this applies to modern CPUs.

If the above is correct, that would mean after fetching mov $2 into the pipeline, it's possible the next fetch would get the updated value at address e and try to execute 00 00 (add %al,(%rax)) which would probably fail.

But if the fetch of mov $2 brings mov $3 into an "instruction cache", would it make sense to think that the next fetch would just get the instruction from that cache (and return mov $3) without re-querying L1? This would effectively make the fetch of these 2 instructions atomic, as long as they share a cache line.

So which is it? Basically there's too many unknowns and too much I can only speculate about, so I'd really appreciate a clockcycle-by-clockcycle breakdown of how 2 fetch phases of the pipeline interact with (changes in) the memory they access.

2
This is all implementation-dependent. Different processors deal with the situation differently.Raymond Chen
For a core modifying its own code, see: Observing stale instruction fetching on x86 with self-modifying code - that's different (and harder) because out-of-order exec of the store has to be sorted out from code-fetch of earlier vs. later instructions in program order. i.e. the moment at which the store must become visible is fixed, unlike with another core where it just happens when it happens.Peter Cordes

2 Answers

3
votes

As Chris said, an RFO (Read For Ownership) can invalidate an I-cache line at any time.

Depending on how superscalar fetch-groups line up, the cache line can be invalidated between fetching the 5-byte mov at 9:, but before fetching the next instruction at e:.

When fetch eventually happens (this core gets a shared copy of the cache line again), RIP = e and it will fetch the last 2 bytes of the mov $4,%rbx. Cross-modifying code needs to make sure that no other core is executing in the middle of where it wants to write one long instruction.

In this case, you'd get 00 00 add %al, (%rax).

Also note that the writing CPU needs to make sure the modification is atomic, e.g. with an 8-byte store (Intel P6 and later CPUs guarantee that stores up to 8 bytes at any alignment within 1 cache line are atomic; AMD doesn't), or lock cmpxchg or lock cmpxchg16b. Otherwise it's possible for a reader to see partially updated instructions. You can consider instruction-fetch to be doing atomic 16-byte loads or something like that.


"The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1,

No.

That wide fetch block is then decoded into multiple x86 instructions! The point of wide fetch is to pull in multiple instructions at once, not to redo it separately for each instruction. That document seems to be about P6 (Pentium III), although P6 only does 16 bytes of actual fetch at once, into a 32-byte wide buffer that lets the CPU take a 16-byte window.

P6 is 3-wide superscalar, and every clock cycle can decode up to 16 bytes of machine code containing up to 3 instructions. (But there's a pre-decode stage to find instruction lengths first...)

See Agner Fog's microarch guide (https://agner.org/optimize/) for details, (with a focus on details that are relevant for turning software performance.) Later microarchitectures add queues between pre-decode and decode. See those sections of Agner Fog's microarch guide, and https://realworldtech.com/merom/ (Core 2).

And of course see https://realworldtech.com/sandy-bridge for more modern x86 with a uop cache. Also https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Core for recent AMD.

For good background before reading any of those, Modern Microprocessors: A 90-Minute Guide!.


For a core modifying its own code, see: Observing stale instruction fetching on x86 with self-modifying code - that's different (and harder) because out-of-order exec of the store has to be sorted out from code-fetch of earlier vs. later instructions in program order. i.e. the moment at which the store must become visible is fixed, unlike with another core where it just happens when it happens.

3
votes

It varies between implementations, but generally, this is managed by the cache coherency protocol of the multiprocessor. In simplest terms, what happens is that when CPU1 writes to a memory location, that location will be invalidated in every other cache in the system. So that write will invalidate the line in CPU2's instruction cache as well as any (partially) decoded instructions in CPU2's uop cache (if it has such a thing). So when CPU2 goes to fetch/execute the next instruction, all those caches will miss and it will stall while things are refetched. Depending on the cache coherency protocol, that may involve waiting for the write to get to memory, or may fetch the modified data directly from CPU1's dcache, or things might go via some shared cache.