Threads implemented as a library package in user space perform significantly better. Why?
They're not.
The fact is that most task switches are caused by threads blocking (having to wait for IO from disk or network, or from user, or for time to pass, or for some kind of semaphore/mutex shared with a different process, or some kind of pipe/message/packet from a different process) or caused by threads unblocking (because whatever they were waiting for happened); and most reasons to block and unblock involve the kernel in some way (e.g. device drivers, networking stack, ...); so doing task switches in kernel when you're already in the kernel is faster (because it avoids the overhead of switching to user-space and back for no sane reason).
Where user-space task switching "works" is when kernel isn't involved at all. This mostly only happens when someone failed to do threads properly (e.g. they've got thousands of threads and coarse-grained locking and are constantly switching between threads due to lock contention, instead of something sensible like a "worker thread pool"). It also only works when all threads are the same priority - you don't want a situation where very important threads belonging to one process don't get CPU time because very unimportant threads belonging to a different process are hogging the CPU (but that's exactly what happens with user-space threading because one process has no idea about threads belonging to a different process).
Mostly; user-space threading is a silly broken mess. It's not faster or "significantly better"; it's worse.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
This is talking about a situation where the CPU itself does the actual task switch (and either the kernel or a user-space library tells the CPU when to do a task switch to what). This has some relatively interesting history behind it...
In the 1980s Intel designed a CPU ("iAPX" - see https://en.wikipedia.org/wiki/Intel_iAPX_432 ) for "secure object oriented programming"; where each object has its own isolated memory segments and its own privilege level, and can transfer control directly to other objects. The general idea being that you'd have a single-tasking system consisting of global objects using cooperating flow control. This failed for multiple reasons, partly because all the protection checks ruined performance, and partly because the majority of software at the time was designed for "multi-process preemptive time sharing, with procedural programming".
When Intel designed protected mode (80286, 80386) they still had hopes for "single-tasking system consisting of global objects using cooperating flow control". They included hardware task/object switching, local descriptor table (so each task/object can have its own isolated segments), call gates (so tasks/objects can transfer control to each other directly), and modified a few control flow instructions (call far
and jmp far
) to support the new control flow. Of course this failed for the same reason iAPX failed; and (as far as I know) nobody has ever used these things for the "global objects using cooperative flow control" they were originally designed for. Some people (e.g. very early Linux) did try to use the hardware task switching for more traditional "multi-process preemptive time sharing, with procedural programming" systems; but found that it was slow because the hardware task switch did too many protection checks that could be avoided by software task switching and saved/reloaded too much state that could be avoided by a software task switching;p and didn't do any of the other stuff needed for a task switch (e.g. keeping statistics of CPU time used, saving/restoring debug registers, etc).
Now.. Andrew S. Tanenbaum is a micro-kernel advocate. His ideal system consists of isolated pieces in user-space (processes, services, drivers, ...) communicating via. synchronous messaging. In practice (ignoring superficial differences in terminology) this "isolated pieces in user-space communicating via. synchronous messaging" is almost entirely identical to Intel's twice failed "global objects using cooperative flow control".
Mostly; in theory (if you ignore all the practical problems, like CPU not saving all of the state, and wanting to do extra work on task switches like tracking statistics), for a specific type of OS that Andrew S. Tanenbaum prefers (micro-kernel with synchronous message passing, without any thread priorities), it's plausible that the paragraph quoted above is more than just wishful thinking.