That quote from Andy @Krazy Glew is about synchronous exceptions discovered during execution of a "normal" instruction, like mov eax, [rdi]
raising #PF if it turns out that RDI is pointing to an unmapped page.1 You expect that not to fault, so you defer doing anything until retirement, in case it was in the shadow of a branch mispredict or an earlier exception.
But yes, his answer doesn't go into detail about how the pipeline optimizes for synchronous int
trap instructions that we know upon decode will always cause an exception. Trap instructions are also pretty rare in the overall instruction mix, so optimizing for them doesn't save you a lot of power; it's only worth doing the things that are easy.
As Andy says, current CPUs don't rename the privilege level and thus can't speculate into an interrupt/exception handler, so stalling fetch/decode after seeing an int
or syscall
is definitely sensible thing. I'm just going to write int
or "trap instruction", but the same goes for syscall
/sysenter
/sysret
/iret
and other privilege-changing "branch" instructions. And the 1-byte versions of int
like int3
(0xcc
) and int1
(0xf1
). The conditional trap-on-overflow into
is interesting; for non-horrible performance in the no-trap case it's probably assumed not to trap. (And of course there are vmcall
and stuff for VMX extensions, and probably SGX EENTER
, and probably other stuff. But as far as stalling the pipeline is concerned, I'd guess all trap instructions are equal except for the conditional into
)
I'd assume that like lfence
, the CPU doesn't speculate past a trap instruction. You're right, there'd be no point in having those uops in the pipeline, because anything after an int
is definitely getting flushed.
IDK if anything would fetch from the IVT (real-mode interrupt vector table) or IDT (interrupt descriptor table) to get the address of an int
handler before the int
instruction becomes non-speculative in the back-end. Possibly. (Some trap instructions, like syscall
, use an MSR to set the handler address, so starting code fetch from there would possibly be useful, especially if it triggers an L1i miss early. This has to be weighed against the possibility of seeing int
and other trap instructions on the wrong path, after a branch miss.)
Mis-speculation hitting a trap instruction is probably rare enough that it would be worth it to start loading from the IDT or prefetching the syscall
entry point as soon as the front-end sees a trap instruction, if the front-end is smart enough to handle all this. But it probably isn't. Leaving the fancy stuff to microcode makes sense to limit complexity of the front end. Traps are rare-ish, even in syscall
-heavy workloads. Batching work to hand off in bigger chunks across the user/kernel barrier is a good thing, because cheap syscall
is very very hard post Spectre...
So at the latest, a trap would be detected in issue/rename (which already knows how to stall for (partially) serializing instructions), and no further uops would be allocated into the out-of-order back end until either the int
was retired and the exception was being taken.
But detecting it in decode seems likely, and not decoding further past an instruction that definitely takes an exception. (And where we don't know where to fetch next.) The decoder stage does know how to stall, e.g. for illegal-instruction traps.
Let's say it is picked up at predecode
That's probably not practical, you don't know it's an int
until full decode. Pre-decode is just instruction-length finding on Intel CPUs. I'd assume that the opcodes for int
and syscall
are just two of many that have the same length.
Building in HW to look deeper searching for trap instructions would cost more power than it's worth in pre-decode. (Remember, traps are very rare, and detecting them early mostly only saves power, so you can't spend more power looking for them than you save by stopping pre-decode after passing along a trap to the decoders.
You need to decode the int
so its microcode can execute and get the CPU started again running the interrupt handler, but yes in theory you could have pre-decode stall in the cycle after passing it through.
It's the regular decoders where jump instructions that branch-prediction missed are identified, for example, so it makes much more sense for the main decode stage to handle traps by not going any further.
Hyperthreading
You don't just power-gate the front-end when you discover a stall. You let the other logical thread have all the cycles.
Hyperthreading makes it less valuable for the front-end to start fetching from memory pointed to by the IDT without the back-end's help. If the other thread isn't stalled, and can benefit from the extra front-end bandwidth while this thread sorts out its trap, the CPU is doing useful work.
I certainly wouldn't rule out code-fetch from the SYSCALL entry-point, because that address is in an MSR, and it's one of the few traps that is performance-relevant in some workloads.
Another thing I'm curious about is how much if any performance impact one logical core switching privilege levels has on performance of the other core. To test this, you'd construct some workload that bottlenecked on your choice of front-end issue bandwidth, a back-end port, back-end dep chain latency, or the back-end's ability to find ILP over a medium to long distance (RS size or ROB size). Or a combination or something else. Then compare cycles/iteration for that test workload running on a core to itself, sharing a core with a tight dec/jnz
thread, a 4x pause / dec/jnz
workload, and a syscall
workload that makes ENOSYS system calls under Linux. Maybe also an int 0x80
workload to compare different traps.
Footnote 1: Exception handling, like #PF on a normal load.
(Off topic, re: innocent looking instructions that fault, not trap instructions that can be detected in the decoders as raising exceptions).
You wait until commit (retirement) because you don't want to start an expensive pipeline flush right away, only to discover that this instruction was in the shadow of a branch miss (or an earlier faulting instruction) and shouldn't have run (with that bad address) in the first place. Let the fast branch-recovery mechanism catch it.
This wait-until-retirement strategy (and a dangerous L1d cache that doesn't squash the load value to 0 for L1d hits where the TLB says it's valid but no read permission) is the key to why Meltdown and L1TF exploit works on some Intel CPUs. (http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/). Understanding Meltdown is pretty helpful to understanding synchronous exception handling strategies in high-performance CPUs: marking the instruction and only doing anything if it reaches retirement is a good cheap strategy because exceptions are very rare.
It's apparently not worth the complexity to have execution units signal back to the front-end to stop fetch / decode / issue if any uop in the back end detects a pending #PF
or other exception. (Presumably because that would more tightly couple parts of the CPU that are otherwise pretty far apart.)
And because instructions from the wrong path might still be in flight during fast recovery from a branch miss, and making sure you only stop the front-end for expected faults on what we think is the current correct path of execution would require more tracking. Any uop in the back-end was at one point thought to be on the correct path, but it might not be anymore by the time it gets to the end of an execution unit.
If you weren't doing fast recovery, then maybe it would be worth having the back-end send a "something is wrong" signal to stall the front-end until the back-end either actually takes an exception, or discovers the correct path.
With SMT (hyperthreading), this could leave more front-end bandwidth for other threads when a thread detected that it was currently speculating down a (possibly correct) path that leads to a fault.
So there is maybe some merit to this idea; I wonder if any CPUs do it?