Question
What are some ideas for cross-modifying code that could trigger unexpected behavior on x86 or x86-x64 systems, where everything is done correctly in the cross-modifying code, with the exception of executing a serializing instruction on the executing processor prior to executing the modified code?
As noted below, I have a Core 2 Duo E6600 processor to test on, which is explicitly mentioned as a processor that is prone to issues regarding this. I will test any ideas shared with me on this machine and give updates.
Background
On x86 and x64 systems, the official guidance for writing cross-modifying code is to do the following:
; Action of Modifying Processor
Store modified code (as data) into code segment;
Memory_Flag ← 1;
; Action of Executing Processor
WHILE (Memory_Flag ≠ 1)
Wait for code to update;
ELIHW;
Execute serializing instruction; (* For example, CPUID instruction *)
Begin executing modified code;
The serializing instruction is explicitly mentioned as necessary in the errata for some processors. For example, Intel Core 2 Duo E6000 series have the following erratum: (from http://www.mathemainzel.info/files/intelX6800andintelE6000.pdf)
The act of one processor, or system bus master, writing data into a currently executing code segment of a second processor with the intent of having the second processor execute that data as code is called cross-modifying code (XMC). XMC that does not force the second processor to execute a synchronizing instruction, prior to execution of the new code, is called unsynchronized XMC.
Software using unsynchronized XMC to modify the instruction byte stream of a processor can see unexpected or unpredictable execution behavior from the processor that is executing the modified code.
There is some speculation as to why unexpected execution behavior could occur if a serializing instruction is not used at http://linux.kernel.narkive.com/FDc9TB0d/patch-linux-kernel-markers:
When the i-fetch has been done and the micro-ops are in the trace cache then there's no longer a direct correlation between the original machine instruction boundaries and the micro ops. This is due to optimization. For example (artificial one for illustrative purposes):
mov eax,ebx
mov memory,eax
mov eax,1
(using intel notation not ATT - force of habit)
In the trace cache there would be no micro ops to update eax with ebx.
Altering the "mov eax,ebx" to "mov ecx,ebx" on the fly invalidates the optimized trace cache, hence the onlhy recourse is a GPF. If the modification doens't invalidate the trace cache then no GPF. The question is: "can we predict th circumstances when the trace cache has not been invalidated", and the answer in general is no since the microarchtecture is not public. But one can guess that modifying the single byte opcode with in interrupting instruction - int3 - doesn't cause an inconsistency that can't be handled. And that's what Intel confirmed. Go ahead and store int3 without the need to synchronise (i.e. force the trace cache to be flushed).
There's also a bit more information at https://sourceware.org/ml/systemtap/2005-q3/msg00208.html:
When we became aware of this I had a long discussion with Intel's microarchitecture guys. It turns out that the reason for this erratum (which incidentally Intel does not intend to fix) is because the trace cache - the stream of micorops resulting from instruction interpretation - cannot guaranteed to be valid. Reading between the lines I assume this issue arises because of optimization done in the trace cache, where it is no longer possible to identify the original instruction boundaries. If the CPU discoverers that the trace cache has been invalidated because of unsynchronized cross-modification then instruction execution will be aborted with a GPF. Further discussion with Intel revealed that replacing the first opcode byte with an int3 would not be subject to this erratum.
Beyond what I've posted here, there's not too much I've seen on the internet regarding this issue. Additionally, I haven't found any public examples of people getting bitten by failing to execute the serializing instruction when using cross-modifying code on x86 and x86-64 systems.
I have a computer running an Intel Core 2 Duo E6600 Processor, which is explicitly documented as being prone to this problem, and I have not been able to write code that triggers this issue.
Writing code to do this is a personal curiosity for me. In production code, I'd just follow the rules, but I figure there's probably something for me to learn in reproducing this.