What does "kernel threads can still be somewhat expensive because system calls are required to switch between threads" really mean?

Question

I read a document describing OS' threads and It says as follows:

A context switch between kernel threads belonging to the same process requires only the registers, program counter, and stack to be changed; the overall memory management information does not need to be switched since both of the threads share the same address space. Thus context switching between two kernel threads is slightly faster than switching between two processes. However, kernel threads can still be somewhat expensive because system calls are required to switch between threads.

I don't get the meaning of the last sentence. What kind of switching between threads happening on system calls It mentions?? As far as I know, every system call in applications needs context switching because It requires to switch to kernel mode, so there is nothing special, however, the sentence in the context of the whole document seems like there is something I don't know about that.

What does the sentence really mean?

The whole document:
http://lass.cs.umass.edu/~shenoy/courses/fall12/lectures/notes/Lec06_notes.pdf

Yeah. Expensive compared to what? Statements like that are thrown out as filler by those who have never developed a multithreaded app and fail to appreciate why multithreaded designs are used. I, for one, would not describe a hardware interrupt, and subsequent driver and scheduler run, as a 'system call':( — Martin James
You are right to see 'issues' with the various 'Intro to Threads' docs and sites. Some are merely misleading and lack scope, others are just wrong:) — Martin James

mevets mevets · Accepted Answer · 2021-02-02T11:43:10

I presume the author is using system call as an abstract way of claiming system overhead; although system call overhead is far from trivial.

User Threads

Inside a user program, you can make threads appear by allocating stacks (somehow), and saving and restoring the user registers into a datastructure. A user-based thread transfer can be as simple as:

void ThreadSwitch(Thread *from, Thread *to) {
     if (setjmp(from->regs) == 0) {
          longjmp(to->regs, 1);
     }
}

where setjmp just stores some cpu registers to an array, and longjmp loads the same registers from an array. This hides a great deal of complexity, like how did I come to have separate stacks and integrate them into the language runtime. The point is, this is a very fast operation.

Hardware

When you invoke a system call (trap), the processor changes execution modes before proceeding to the kernel. It is tempting to think of the trap as being the few operations outlined in the technical reference manual for the CPU; but in all but the most basic processors it is so much more. As the CPU runs, it builds up an internal context for the current execution, which may include speculative values for registers, memory and a sort of sparse associative array for branch prediction.

When a CPU switches modes, as occurs in a trap, some of these must be discarded [lost opportunity cost], and some must be committed [synchronization cost]. An ARM A73 can sustain about 2 instructions per cycle, when all of this context is constructed. Lacking context, it can drop by over a factor of 16.

Additionally, a trap instruction is a form of barrier in terms of memory ordering; before the first instruction of the trap executes, all pending memory writes must become visible, which typically means transition to L1 cache. Store buffers on conventional CPUs range from a handful of entries to hundreds; so a trap can have a latency of 100s of L1 writes (with possible Ln evictions).

Kernels

The kernel entry for the ThreadSwitchSysCall is unlikely to look like the above; although the first bit is logically similar to setjmp(). Each Thread most likely sits on a list indicating its disposition { Ready, Waiting, ... } and has associated scheduling parameters and affinities. Affinity refers to which CPU the thread prefers or is restricted to running on. So, the destination thread may need to be removed from a list (ensuring that the list is protected from other cpus) inserted into another list (again, consistently), possibly invoking a mechanism to alert another CPU that there is work for it, then determining which Thread should run on this. Finally with the chosen Thread, it will execute the equivalent of the longjmp() to resume operation in its context.

That is pretty much the minimum, that you might expect in an RTOS or microkernel. A heavyweight system such as Linux/Windows/MacOS will have much more housekeeping to perform on such a switch.

TL:DR

So, when the textbook says expensive because system calls, it is just saving you all of the above detail. It also isn't wrong. If you look at the concurrent programming model in modern systems languages like golang, it treats kernel threads as virtual cpus, and maintains its own threads (goroutines) as user threads multiplexed onto these virtual cpus. It isn't an accident, and they didn't read the wrong text book.

What does "kernel threads can still be somewhat expensive because system calls are required to switch between threads" really mean?

1 Answers