Preemptive Kernel Design

Question

I've always thought kernels to be non-preemptable. That is, the kernel code runs to completion, with scheduling taking place only when returning to userspace. I am now curious about the changes need to be made when designing a preemptable kernel.

My thought process so far:

Say the kernel is running on behalf of some userspace process and is suddenly preempted. We store the current processor information in the process's kernel stack. We then mark the process as runnable. When that process is scheduled, the information we grab does not lead us back to userspace, but instead causes us to continue running the kernel task (eg. syscall).
There are times, however, when the kernel is not running in process context. We need to somehow make the kernel non-preemptable at those points. Being able to stop the scheduler in action sound like maddness.
I smell something fishy with locking: suppose we are running a syscall that acquires some lock A. If it is pre-empted we may reach an issue if A is needed by something like the scheduler.

I am wondering if there are any mistakes in my reasoning, or whether I am failing to consider something critical.

Thank You

antiduh antiduh · Accepted Answer · 2014-07-03T23:39:57

FreeBSD has some great notes on this topic, and may be a great case study. You can read more here:

http://www.freebsd.org/doc/en/books/arch-handbook/smp-design.html

On the topic of locking, they're more careful to use finer grained locking, such that preemption will likely only occur outside of holding a lock.

In the cases where preempting a lock-holding thread will cause correctness issues, they mention that they have an API for indicating that the thread is temporarily in a non-interruptible region:

While locks can protect most data in the case of a preemption, not all of the kernel is preemption safe. For example, if a thread holding a spin mutex preempted and the new thread attempts to grab the same spin mutex, the new thread may spin forever as the interrupted thread may never get a chance to execute. Also, some code such as the code to assign an address space number for a process during exec on the Alpha needs to not be preempted as it supports the actual context switch code. Preemption is disabled for these code sections by using a critical section.

And:

The responsibility of the critical section API is to prevent context switches inside of a critical section. With a fully preemptive kernel, every setrunqueue of a thread other than the current thread is a preemption point. One implementation is for critical_enter to set a per-thread flag that is cleared by its counterpart. If setrunqueue is called with this flag set, it does not preempt regardless of the priority of the new thread relative to the current thread. However, since critical sections are used in spin mutexes to prevent context switches and multiple spin mutexes can be acquired, the critical section API must support nesting. For this reason the current implementation uses a nesting count instead of a single per-thread flag.

So in your example, if you have a preemptable thread that hold locks important to the scheduler, you might mark that thread as temporarily unpreemptable.

You can find parallels for this approach even in application level software. .Net has a concept called Constrained Execution Regions [1] [2] [3], which, while they don't have to do with scheduling, are used to signal to the VM that some block of code that is about to execute must execute 'atomically', and that the VM should postpone any Thread.Abort()s it may perform, as well as ensuring that the code can complete (ensure methods are already JIT'd and that there's enough stack space). Different purpose, but a similar rough idea - tell the scheduling overlords "if you interrupt me in a weird way, you may break correctness".

Of course, in the case of kernel preemption or .Net CERs, it is up to the developer to properly identify all of the areas where crtical execution regions are occurring to ensure that certain lock invariants are enforced.

FreeBSD has facilities they use to aid in debugging these sort of problems to help identify, eg, deadlocks. One particular technique is lock ordering - each lock is given a particular priority; as you grab locks, you record whats the current lock priority. Then, if you ever attempt to grab a lock that is lower than the current priority, you know that you've violated lock ordering and that you should inform the user by recording an operating system fault.

The need for lock ordering may not be immediately apparent, but consider the popular example of deadlocking - 2 resources that are protected by two locks:

Thread A wants to grab lock 1 and 2. Thread B wants to grab lock 1 and 2.. but:

Thread A grabs lock 1
Thread B grabs lock 2
Thread A tries to grab lock 2, but can't
Thread B tries to grab lock 1, but can't.
Deadlock is achieved.

The threads did not grab locks in a consistent order. Thread B should actually notice that he's holding lock 2 while trying to grab lock 1; since 1 < 2, he's grabbing locks out of order and should abort. Viola, deadlocks are avoided, or at least discovered and repairable.

Preemptive Kernel Design

1 Answers