How does the Linux kernel enter supervisor mode in x86?

Question

I tried to probe the event when the mode switch happens (user->kernel mode), as a result, I need to find which function will be triggered when the transition happens.

It seems that SBI is the placed doing transition for RISC-V. I'm wondering where is the code to handle this for x86?

On x86, normally user-space uses syscall. The 64-bit-kernel entry points are defined in arch/x86/entry/entry_64.S and arch/x86/entry/entry_64_compat.S. See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some info on how those system-call entry points work. (At least before a recent re-engineering to dispatch via C.) — Peter Cordes
There is no single way. Accessing memory could trigger kernel mode (usually to update access page count, for cache management). Any error can also trigger kernel. Debugging instructions or traps. Long jumps (depending on descriptor). Interrupts (hardware, e.g. timer for scheduler, or from devices, software interrupts), systemcalls... — Giacomo Catenazzi

Marco Bonelli Marco Bonelli · Accepted Answer · 2021-03-05T13:57:00

It's not that simple. In x86, there are 4 different privilege levels: 0 (operating system kernel), 1, 2, and 3 (applications). Privilege levels 1 and 2 aren't used in Linux: the kernel runs at privilege level 0 while user space code runs at privilege level 3. The current privilege level (CPL) is stored in bits 0 and 1 of the CS (code segment) register.

There are multiple ways in which the transition from user to kernel can happen:

Through hardware interrupts: page faults, general protection faults, devices, hardware timer, and so on.
Through software interrupts: the int instruction raises a software interrupt. The most common in Linux is int 0x80, which is configured to be used for system calls from user space to kernel space.
Through specialized instructions like sysenter and syscall.

In any case, there is no actual code that does the transition: it is done by the processor itself, which switches from one privilege level to the other, and sets up segment selectors, instruction pointer, stack pointer and more according to the information that was set up by the kernel right after booting.

In the case of interrupts, the entries of the Interrupt Descriptor Table (IDT) are used. See this useful documentation page about interrupts in Linux which explains more about the IDT. If you want to get into the details, check out Chapter 5 of the Intel 64 and IA-32 architectures software developer's manual, Volume 3.

In short, each IDT entry specifies a descriptor privilege level (DPL) and a new code segment and offset. In case of software interrupts, some privilege level checks are made by the processor (one of which is CPL <= DPL) to determine whether the code that issued the interrupt has the privilege to do so. Then, the interrupt handler is executed, which implicitly sets the new CS register with the privilege level bits set to 0. This is how the canonical int 0x80 syscall for x86 32bit is made.

In case of specialized instructions like sysenter and syscall, the details differ, but the concept is similar: the CPU checks privileges and then retrieves the information from dedicated Model Specific Registers (MSR) that were previously set up by the kernel after boot.

For system calls the result is always the same: user code switches to privilege level 0 and starts executing kernel code, ending up right at the beginning of one of the different syscall entry points defined by the kernel.

Possible syscall entry points are:

entry_INT80_32 for 32-bit int 0x80
entry_INT80_compat for 32-bit int 0x80 on a 64-bit kernel
entry_SYSENTER_32 for 32-bit sysenter
entry_SYSENTER_compat for 32-bit sysenter on a 64-bit kernel
entry_SYSCALL_64 for 64-bit syscall
entry_SYSCALL_compat for 32-bit syscall on 64-bit kernel (special entry point which is not used by user code, in theory syscall is also a valid 32-bit instruction on AMD CPUs, but Linux only uses it for 64-bit because of its weird semantics)

How does the Linux kernel enter supervisor mode in x86?

1 Answers