I'm trying to understand how the "fetch" phase of the CPU pipeline interacts with memory.
Let's say I have these instructions:
4: bb 01 00 00 00 mov $1,%ebx
9: bb 02 00 00 00 mov $2,%ebx
e: b3 03 mov $3,%bl
What happens if CPU1 writes 00 48 c7 c3 04 00 00 00
to memory address 8 (i.e. 64-bit aligned) while CPU2 is executing these same instructions? The instruction stream would atomically change from 2 instructions to 1 like this:
4: bb 01 00 00 00 mov $1,%ebx
9: 48 c7 c3 04 00 00 00 mov $4,%rbx
Since CPU1 is writing to the same memory that CPU2 is reading from, there's contention.
Would the write cause the CPU2 pipeline to stall while it refreshes its L1 cache?
Let's say that CPU2 has just completed the "fetch" pĥase for mov $2
, would that be discarded in order to re-fetch the updated memory?
Additionally there's the issue of atomicity when changing 2 instructions into 1.
I found this quite old document that mentions "The instruction fetch unit fetches one 32-byte cache line in each clock cycle from the instruction cache memory" which I think can be interpreted to mean that each instruction gets a fresh copy of the cache line from L1, even if they share the same cache line. But I don't know if/how this applies to modern CPUs.
If the above is correct, that would mean after fetching mov $2
into the pipeline, it's possible the next fetch would get the updated value at address e
and try to execute 00 00
(add %al,(%rax)
) which would probably fail.
But if the fetch of mov $2
brings mov $3
into an "instruction cache", would it
make sense to think that the next fetch would just get the instruction from that cache (and return mov $3
) without re-querying L1?
This would effectively make the fetch of these 2 instructions atomic, as long as they share a cache line.
So which is it? Basically there's too many unknowns and too much I can only speculate about, so I'd really appreciate a clockcycle-by-clockcycle breakdown of how 2 fetch phases of the pipeline interact with (changes in) the memory they access.