What happens to software interrupts in the pipeline?

Question

After reading this:

When an interrupt occurs, what happens to instructions in the pipeline?

There is not much information on what happens to software interrupts but we do learn the following:

Conversely, exceptions, things like page faults, mark the instruction affected. When that instruction is about to commit, at that point all later instructions after the exception are flushed, and instruction fetch is redirected.

I was wondering what would happen to software interrupts (INT 0xX) in the pipeline, firstly, when are they detected? Are they detected at the predecode stage perhaps? In the instruction queue? At the decode stage? Or do they get to the backend and immediately complete (don't enter the reservation station), retire in turn and the retirement stage picks up that it is an INT instruction (seems wasteful).

Let's say it is picked up at predecode, there must be a method of signalling the IFU to stop fetching instructions or indeed clock/power gating it, or if it's picked up in the instruction queue, a way of flushing instructions before it in the queue. There must then be a way of signalling to some sort of logic ('control unit') for instance to generate the uops for the software interrupt (indexing to IDT, checking DPL >=CPL >=segment RPL, etc, etc), naive suggestion, but if anyone knows any better about this process, great.

I also wonder how it handles it when this process is disturbed, i.e. a hardware interrupt occurs (bearing in mind traps don't clear IF in EFLAGS) and now has to begin a whole new process of interrupt handling and uop generation, how would it get back to its state of handling the software interrupt afterwards.

Do the statements from the quoted answer "most machines throw away all instructions in the pipeline, in pipestages before the pipestage where the interrupt logic lives" and "the interrupt logic typically lives in the last stage of the pipeline, WB, corresponding roughly to the commit pipestage of advanced machines." not answer this? — Patrick Roberts
@PatrickRoberts The first one I was pretty much already aware of, the second one, thanks for finding that, I must have skimmed over it. I'm still not sure about the quandary in the final paragraph though, it doesn't address that of course. I'd like to know the specifics of the control unit that deals with this — Lewis Kelsey
@PatrickRoberts but more importantly what are those instructions after it that it throws away, does that mean that the BPU is capable of resteering the pipeline to the address in the IDT? Otherwise what instructions would they be, what's the point in fetching them if they're just going to be thrown away, wouldn't it be more efficient to catch the software interrupt earlier in the pipeline — Lewis Kelsey

Peter Cordes Peter Cordes · Accepted Answer · 2019-01-30T01:39:12

That quote from Andy @Krazy Glew is about synchronous exceptions discovered during execution of a "normal" instruction, like mov eax, [rdi] raising #PF if it turns out that RDI is pointing to an unmapped page.¹ You expect that not to fault, so you defer doing anything until retirement, in case it was in the shadow of a branch mispredict or an earlier exception.

But yes, his answer doesn't go into detail about how the pipeline optimizes for synchronous int trap instructions that we know upon decode will always cause an exception. Trap instructions are also pretty rare in the overall instruction mix, so optimizing for them doesn't save you a lot of power; it's only worth doing the things that are easy.

As Andy says, current CPUs don't rename the privilege level and thus can't speculate into an interrupt/exception handler, so stalling fetch/decode after seeing an int or syscall is definitely sensible thing. I'm just going to write int or "trap instruction", but the same goes for syscall/sysenter/sysret/iret and other privilege-changing "branch" instructions. And the 1-byte versions of int like int3 (0xcc) and int1 (0xf1). The conditional trap-on-overflow into is interesting; for non-horrible performance in the no-trap case it's probably assumed not to trap. (And of course there are vmcall and stuff for VMX extensions, and probably SGX EENTER, and probably other stuff. But as far as stalling the pipeline is concerned, I'd guess all trap instructions are equal except for the conditional into)

I'd assume that like lfence, the CPU doesn't speculate past a trap instruction. You're right, there'd be no point in having those uops in the pipeline, because anything after an int is definitely getting flushed.

IDK if anything would fetch from the IVT (real-mode interrupt vector table) or IDT (interrupt descriptor table) to get the address of an int handler before the int instruction becomes non-speculative in the back-end. Possibly. (Some trap instructions, like syscall, use an MSR to set the handler address, so starting code fetch from there would possibly be useful, especially if it triggers an L1i miss early. This has to be weighed against the possibility of seeing int and other trap instructions on the wrong path, after a branch miss.)

Mis-speculation hitting a trap instruction is probably rare enough that it would be worth it to start loading from the IDT or prefetching the syscall entry point as soon as the front-end sees a trap instruction, if the front-end is smart enough to handle all this. But it probably isn't. Leaving the fancy stuff to microcode makes sense to limit complexity of the front end. Traps are rare-ish, even in syscall-heavy workloads. Batching work to hand off in bigger chunks across the user/kernel barrier is a good thing, because cheap syscall is very very hard post Spectre...

So at the latest, a trap would be detected in issue/rename (which already knows how to stall for (partially) serializing instructions), and no further uops would be allocated into the out-of-order back end until either the int was retired and the exception was being taken.

But detecting it in decode seems likely, and not decoding further past an instruction that definitely takes an exception. (And where we don't know where to fetch next.) The decoder stage does know how to stall, e.g. for illegal-instruction traps.

Let's say it is picked up at predecode

That's probably not practical, you don't know it's an int until full decode. Pre-decode is just instruction-length finding on Intel CPUs. I'd assume that the opcodes for int and syscall are just two of many that have the same length.

Building in HW to look deeper searching for trap instructions would cost more power than it's worth in pre-decode. (Remember, traps are very rare, and detecting them early mostly only saves power, so you can't spend more power looking for them than you save by stopping pre-decode after passing along a trap to the decoders.

You need to decode the int so its microcode can execute and get the CPU started again running the interrupt handler, but yes in theory you could have pre-decode stall in the cycle after passing it through.

It's the regular decoders where jump instructions that branch-prediction missed are identified, for example, so it makes much more sense for the main decode stage to handle traps by not going any further.

Hyperthreading

You don't just power-gate the front-end when you discover a stall. You let the other logical thread have all the cycles.

Hyperthreading makes it less valuable for the front-end to start fetching from memory pointed to by the IDT without the back-end's help. If the other thread isn't stalled, and can benefit from the extra front-end bandwidth while this thread sorts out its trap, the CPU is doing useful work.

I certainly wouldn't rule out code-fetch from the SYSCALL entry-point, because that address is in an MSR, and it's one of the few traps that is performance-relevant in some workloads.

Another thing I'm curious about is how much if any performance impact one logical core switching privilege levels has on performance of the other core. To test this, you'd construct some workload that bottlenecked on your choice of front-end issue bandwidth, a back-end port, back-end dep chain latency, or the back-end's ability to find ILP over a medium to long distance (RS size or ROB size). Or a combination or something else. Then compare cycles/iteration for that test workload running on a core to itself, sharing a core with a tight dec/jnz thread, a 4x pause / dec/jnz workload, and a syscall workload that makes ENOSYS system calls under Linux. Maybe also an int 0x80 workload to compare different traps.

Footnote 1: Exception handling, like #PF on a normal load.

(Off topic, re: innocent looking instructions that fault, not trap instructions that can be detected in the decoders as raising exceptions).

You wait until commit (retirement) because you don't want to start an expensive pipeline flush right away, only to discover that this instruction was in the shadow of a branch miss (or an earlier faulting instruction) and shouldn't have run (with that bad address) in the first place. Let the fast branch-recovery mechanism catch it.

This wait-until-retirement strategy (and a dangerous L1d cache that doesn't squash the load value to 0 for L1d hits where the TLB says it's valid but no read permission) is the key to why Meltdown and L1TF exploit works on some Intel CPUs. (http://blog.stuffedcow.net/2018/05/meltdown-microarchitecture/). Understanding Meltdown is pretty helpful to understanding synchronous exception handling strategies in high-performance CPUs: marking the instruction and only doing anything if it reaches retirement is a good cheap strategy because exceptions are very rare.

It's apparently not worth the complexity to have execution units signal back to the front-end to stop fetch / decode / issue if any uop in the back end detects a pending #PF or other exception. (Presumably because that would more tightly couple parts of the CPU that are otherwise pretty far apart.)

And because instructions from the wrong path might still be in flight during fast recovery from a branch miss, and making sure you only stop the front-end for expected faults on what we think is the current correct path of execution would require more tracking. Any uop in the back-end was at one point thought to be on the correct path, but it might not be anymore by the time it gets to the end of an execution unit.

If you weren't doing fast recovery, then maybe it would be worth having the back-end send a "something is wrong" signal to stall the front-end until the back-end either actually takes an exception, or discovers the correct path.

With SMT (hyperthreading), this could leave more front-end bandwidth for other threads when a thread detected that it was currently speculating down a (possibly correct) path that leads to a fault.

So there is maybe some merit to this idea; I wonder if any CPUs do it?

What happens to software interrupts in the pipeline?

2 Answers

Hyperthreading

Footnote 1: Exception handling, like #PF on a normal load.