What is a retpoline and how does it work?

Question

In order to mitigate against kernel or cross-process memory disclosure (the Spectre attack), the Linux kernel¹ will be compiled with a new option, -mindirect-branch=thunk-extern introduced to gcc to perform indirect calls through a so-called retpoline.

This appears to be a newly invented term as a Google search turns up only very recent use (generally all in 2018).

What is a retpoline and how does it prevent the recent kernel information disclosure attacks?

¹ It's not Linux specific, however - similar or identical construct seems to be used as part of the mitigation strategies on other OSes.

oh, so it's pronounced /ˌtræmpəˈlin/ (American) or /ˈtræmpəˌliːn/ (British) — Walter Tross
You might mention that this is the Linux kernel, though gcc points that way! I did not recognise lkml.org/lkml/2018/1/3/780 as on the Linux Kernel Mailing List site, not even once I looked there (and was served a snapshot as it was offline). — PJTraill
@PJTraill - good point, I updated the question text. Note that I saw it first in the Linux kernel because of it's relatively open development process, but no doubt the same or similar techniques are being uses as mitigations across the spectrum of open and closed source OSes. So I don't see this as Linux-specific, but the link certainly is. — BeeOnRope

Tobias Ribizel Tobias Ribizel · Accepted Answer · 2018-01-04T16:25:38

The article mentioned by sgbj in the comments written by Google's Paul Turner explains the following in much more detail, but I'll give it a shot:

As far as I can piece this together from the limited information at the moment, a retpoline is a return trampoline that uses an infinite loop that is never executed to prevent the CPU from speculating on the target of an indirect jump.

The basic approach can be seen in Andi Kleen's kernel branch addressing this issue:

It introduces the new __x86.indirect_thunk call that loads the call target whose memory address (which I'll call ADDR) is stored on top of the stack and executes the jump using a the RET instruction. The thunk itself is then called using the NOSPEC_JMP/CALL macro, which was used to replace many (if not all) indirect calls and jumps. The macro simply places the call target on the stack and sets the return address correctly, if necessary (note the non-linear control flow):

.macro NOSPEC_CALL target
    jmp     1221f            /* jumps to the end of the macro */
1222:
    push    \target          /* pushes ADDR to the stack */
    jmp __x86.indirect_thunk /* executes the indirect jump */
1221:
    call    1222b            /* pushes the return address to the stack */
.endm

The placement of call in the end is necessary so that when the indirect call is finished, the control flow continues behind the use of the NOSPEC_CALL macro, so it can be used in place of a regular call

The thunk itself looks as follows:

    call retpoline_call_target
2:
    lfence /* stop speculation */
    jmp 2b
retpoline_call_target:
    lea 8(%rsp), %rsp 
    ret

The control flow can get a bit confusing here, so let me clarify:

call pushes the current instruction pointer (label 2) to the stack.
lea adds 8 to the stack pointer, effectively discarding the most recently pushed quadword, which is the last return address (to label 2). After this, the top of the stack points at the real return address ADDR again.
ret jumps to *ADDR and resets the stack pointer to the beginning of the call stack.

In the end, this whole behaviour is practically equivalent to jumping directly to *ADDR. The one benefit we get is that the branch predictor used for return statements (Return Stack Buffer, RSB), when executing the call instruction, assumes that the corresponding ret statement will jump to the label 2.

The part after the label 2 actually never gets executed, it's simply an infinite loop that would in theory fill the instruction pipeline with JMP instructions. By using LFENCE,PAUSE or more generally an instruction causing the instruction pipeline to be stall stops the CPU from wasting any power and time on this speculative execution. This is because in case the call to retpoline_call_target would return normally, the LFENCE would be the next instruction to be executed. This is also what the branch predictor will predict based on the original return address (the label 2)

To quote from Intel's architecture manual:

Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute until the LFENCE completes.

Note however that the specification never mentions that LFENCE and PAUSE cause the pipeline to stall, so I'm reading a bit between the lines here.

Now back to your original question: The kernel memory information disclosure is possible because of the combination of two ideas:

Even though speculative execution should be side-effect free when the speculation was wrong, speculative execution still affects the cache hierarchy. This means that when a memory load is executed speculatively, it may still have caused a cache line to be evicted. This change in the cache hierarchy can be identified by carefully measuring the access time to memory that is mapped onto the same cache set.
You can even leak some bits of arbitrary memory when the source address of the memory read was itself read from kernel memory.
The indirect branch predictor of Intel CPUs only uses the lowermost 12 bits of the source instruction, thus it is easy to poison all 2^12 possible prediction histories with user-controlled memory addresses. These can then, when the indirect jump is predicted within the kernel, be speculatively executed with kernel privileges. Using the cache-timing side-channel, you can thus leak arbitrary kernel memory.

UPDATE: On the kernel mailing list, there is an ongoing discussion that leads me to believe retpolines don't fully mitigate the branch prediction issues, as when the Return Stack Buffer (RSB) runs empty, more recent Intel architectures (Skylake+) fall back to the vulnerable Branch Target Buffer (BTB):

Retpoline as a mitigation strategy swaps indirect branches for returns, to avoid using predictions which come from the BTB, as they can be poisoned by an attacker. The problem with Skylake+ is that an RSB underflow falls back to using a BTB prediction, which allows the attacker to take control of speculation.

What is a retpoline and how does it work?

3 Answers