Instruction fetch accesses passing locked instructions

Question

The Intel Software Developer's Manual mentions that "instruction fetch and page table accesses can pass locked instructions". What does this mean, and why does it matter?

There's a post that says that many Windows functions begin with a MOV EDI, EDI instruction, because it is useful for safe code hooking: it can be atomically replaced with a two-byte relative jump. But if fetch accesses to memory can "pass locked instructions", is it possible for the following to happen?

cpu 0 atomically replaces a MOV EDI, EDI instruction with a relative jump
cpu 1 "passes the locked instruction", fetching and executing the stale MOV EDI, EDI

Would it also be possible for something like this happen?

cpu 0 atomically replaces a MOV EDI, EDI instruction with a relative jump
because instruction fetches can "pass the locked instructions", the replacement of the instruction can be considered non-atomic from the context of instruction fetches, so cpu 1 fetches 1 byte from the stale instruction and 1 byte from the new instruction

From Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3: "System Programming Guide"

Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.

For the P6 family processors, locked operations serialize all outstanding load and store operations (that is, wait for them to complete). This rule is also true for the Pentium 4 and Intel Xeon processors, with one exception. Load operations that reference weakly ordered memory types (such as the WC memory type) may not be serialized.

Link: Why do Windows functions all begin with a pointless MOV EDI, EDI instruction?

I'm not sure how the different statements in your question fit together. Here's what I think matters: 1. You want to be able to insert a two-byte jump instruction. 2. The place where you insert that instruction also has to be a two-byte instruction; it cannot, say, be two consecutive, one-byte NOPs, because execution might have executed the first one-byte instruction just before you overwrite the instructions and then fetch a torn second half. Locked memory access doesn't seem to have anything to do with this. — Kerrek SB
It's quite clear why instruction fetches and page walks were excluded from having to serialize, it would make them too slow and force the CPU to add expensive hardware to compare their addresses against buffered stores. On the other hand, they're not critical to protect in most multithreading scenarios — Leeor
Thanks for the comment, Kerrek SB. I updated the question in an attempt to make it more clear what I'm wondering. — Jason

Leeor Leeor · Accepted Answer · 2015-01-17T22:00:10

Regarding the second scenario - "passing a locked instruction" doesn't mean it breaks atomicity. If the store atomically writes these 2 instruction bytes, you can't see only one of them at any point (the store would simply operate on the full cacheline - note that it won't be atomic if the 2 bytes are split over 2 lines). What it does mean is that any lock instruction you put in order to try synchronizing would not block the code fetch, so in terms of memory ordering - it can occur before or after it.

Now, regarding the first scenario and the question in general - note that there's no lock in your description. The case you describe is perfectly valid even if it were a data read instead of code read - there's no inherent order between the two cores other than what you enforce yourself. In order to enforce such order, you could start working with barriers and semaphores, or any other method, and it would eventually boil down to some lock blocking cpu 1 until cpu 0 signals that the write is done.

In that case, a data read would have been stalled by the lock, but a younger code read could actually fetch the old data in spite of your attempt to protect it. However, here comes a mechanism x86 cores usually implement called SMC (self-modifying code) flush - the store from cpu 0 snoops the instruction cache in cpu 1, detects the stale code there, and since it can't tell where exactly this code is along the pipe, or what effects is may have incurred already (for all we know there could be a halt instruction there, or worse) - it would simply flush the entire pipeline. The exact details may differ between different products, but the concept is very old.

The page walk case is a little more complicated, but there's also a mechanism here that would detect most cases of a modification during usage - look for 'TLB shootdown". Note that in some cases, both SMC and TLB modification during run are perfectly valid and serve a purpose (SMC is very often used for JITting, and page moves are a cheap way to pass data between processes without having to copy it.

Instruction fetch accesses passing locked instructions

1 Answers