I presume the author is using system call as an abstract way of claiming system overhead; although system call overhead is far from trivial.
User Threads
Inside a user program, you can make threads appear by allocating stacks (somehow), and saving and restoring the user registers into a datastructure.
A user-based thread transfer can be as simple as:
void ThreadSwitch(Thread *from, Thread *to) {
if (setjmp(from->regs) == 0) {
longjmp(to->regs, 1);
}
}
where setjmp just stores some cpu registers to an array, and longjmp loads the same registers from an array. This hides a great deal of complexity, like how did I come to have separate stacks and integrate them into the language runtime. The point is, this is a very fast operation.
Hardware
When you invoke a system call (trap), the processor changes execution modes before proceeding to the kernel. It is tempting to think of the trap as being the few operations outlined in the technical reference manual for the CPU; but in all but the most basic processors it is so much more. As the CPU runs, it builds up an internal context for the current execution, which may include speculative values for registers, memory and a sort of sparse associative array for branch prediction.
When a CPU switches modes, as occurs in a trap, some of these must be discarded [lost opportunity cost]
, and some must be committed [synchronization cost]
.
An ARM A73 can sustain about 2 instructions per cycle, when all of this context is constructed. Lacking context, it can drop by over a factor of 16.
Additionally, a trap instruction is a form of barrier in terms of memory ordering; before the first instruction of the trap executes, all pending memory writes must become visible, which typically means transition to L1 cache. Store buffers on conventional CPUs range from a handful of entries to hundreds; so a trap can have a latency of 100s of L1 writes (with possible Ln evictions).
Kernels
The kernel entry for the ThreadSwitchSysCall
is unlikely to look like the above; although the first bit is logically similar to setjmp(). Each Thread most likely sits on a list indicating its disposition { Ready, Waiting, ... } and has associated scheduling parameters and affinities. Affinity refers to which CPU the thread prefers or is restricted to running on. So, the destination thread may need to be removed from a list (ensuring that the list is protected from other cpus) inserted into another list (again, consistently), possibly invoking a mechanism to alert another CPU that there is work for it, then determining which Thread should run on this. Finally with the chosen Thread, it will execute the equivalent of the longjmp() to resume operation in its context.
That is pretty much the minimum, that you might expect in an RTOS or microkernel. A heavyweight system such as Linux/Windows/MacOS will have much more housekeeping to perform on such a switch.
TL:DR
So, when the textbook says expensive because system calls, it is just saving you all of the above detail. It also isn't wrong. If you look at the concurrent programming model in modern systems languages like golang, it treats kernel threads as virtual cpus, and maintains its own threads (goroutines)
as user threads multiplexed onto these virtual cpus. It isn't an accident, and they didn't read the wrong text book.