3
votes

I have been trying to understand how context switching works in Linux Kernel. It appears to me that there is a situation (explained later) which results in no invocation of IRET instruction after the interrupt (I am sure that there is something that I am missing!). I am assuming that invocation of IRET after the interrupt is extremely necessary, since you can't get the same interrupt until you invoke IRET. I am only worried about uni-processor kernel running on x86 arch.

The situation that I think might result in the described behavior is as follows:

  • Process A running in kernel mode calls schedule() voluntarily (for example while trying to acquire an already locked mutex).

  • schedule() decides to perform a context switch to process B and hence calls context_switch()

  • context_switch() switches virtual memory from A to B by calling switch_mm()

  • context_switch() runs macro switch_to() to switch stacks and actually change the running process from A to B. Note that process A is now stuck inside switch_to() and the stack of process A looks like (stack growing downwards):


 ...
 [mutex_lock()]
 [schedule()]
 [context_switch()] (Stack Top)

  • Process B starts running. At some later time, it receives a timer interrupt and the timer interrupt handler decides that process B needs a reschedule.

  • On return from timer interrupt (but before invoking IRET) preempt_schedule_irq() is invoked.

  • preempt_schedule_irq() calls schedule().

  • schedule() decides to context switch to process A and calls context_switch().

  • context_switch() calls switch_mm() to switch the virtual memory.

  • context_switch() calls switch_to() to switch stacks. At this point, stack of process B looks like following:


...
[IRET return frame]
[ret_from_interrupt()]
[preempt_schedule_irq()]
[schedule()]
[context_switch()] (Stack top)

Now process A is running with its stack resumed. Since, context_switch() function in A was not invoked due to a timer interrupt, process A does not call IRET and it continues execution of mutex_lock(). This scenario may lead to blocking of timer interrupt forever.

What am I missing here?

2
It has been ages since I last looked, a favourite subject matter of mine, have you not considered that the kernel's timer as part of scheduler would issue the IRET? Might it be easier to download the linux kernel v.99b IIRC that is the smallest download of source to actually read it without getting overwhelmed with the size as is now?t0mm13b
Well, I understand that in most cases kernel's timer handler would issue the IRET. But, the problem is the scheduler is not always invoked by the timer. About Linux Kernel v.99b, that sounds like a good place to look! Thanks :)harshad shirwadkar
I'm not sure if I understand your question, but it doesn't appear that any iret would be lost. Since mutex_lock() is a syscall, whenever it is invoked, it needs to return to user-space using iret (at least it is true when software interrupt int 0x80 is used to invoke syscalls). So when process A gets resumed, it will simply finish executing mutex_lock() and iret.zack

2 Answers

1
votes

Economical with the truth time, non-linux-specifc explanation/example:

Thread A does not have to call IRET - the kernel code calls IRET to return execution to thread A, after all, that's one way it may have lost it in the first place - a hardware interrupt from some peripheral device.

Typically, when thread A lost execution earlier on due to some other hardware interrupt or sycall, thread A's stack pointer is saved in the kernel TCB pointing to an IRET return frame on the stack of A before switching to the kernel stack for all the internal scheduler etc gubbins. If an exact IRET frame does not exist because of the particular syscall mechanism used, one is assembled. When the kernel needs to resume A, the kernel reloads the hardware SP with thread A's stored SP and IRET's to user space. Job done - A resumes running with interrupts etc, enabled.

The kernel has then lost control. When it's entered again by the next hardware interrupt/driver or syscall, it can set it's internal SP to the top of its own private stack since it keeps no state data on it between invocations.

That's just one way in which it can be made to work:) Obviously, the exact mechanism/s are ABI/architecture dependent.

1
votes

I don't know about Linux, but in many operating systems, the context switch is usually performed by a dispatcher, not an interrupt handler. If an interrupt doesn't result in a pending context switch, it just returns. If an interrupt triggered context switch is needed, the current state is saved and the interrupt exits via the dispatcher (the dispatcher does the IRET). This gets more complicated if nested interrupts are allowed, since the initial interrupt is the one that goes to the dispatcher, regardless of of which nested interrupt handler(s) triggered a context switch condition. An interrupt needs to check the saved state to see if it's a nested interrupt, and if not, it can disable interrupts to prevent nested interrupts occurring when it does the check for and optionally exits via the dispatcher to perform a context switch. If the interrupt is a nested interrupt, it only has to set a context switch flag if needed, and rely on the initial interrupt to do the check and context switch.

Usually, there's no need for an interrupt to save a threads state in a kernel TCB unless a context switch is going to occur.

The dispatcher also handles the cases where context switches are triggered by non-interrupt conditions, such as mutex, semaphore, ... .