What is the overhead of a context-switch?

Question

Originally I believed the overhead to a context-switch was the TLB being flushed. However I just saw on wikipedia:

http://en.wikipedia.org/wiki/Translation_lookaside_buffer

In 2008, both Intel (Nehalem)[18] and AMD (SVM)[19] have introduced tags as part of the TLB entry and dedicated hardware that checks the tag during lookup. Even though these are not fully exploited, it is envisioned that in the future, these tags will identify the address space to which every TLB entry belongs. Thus a context switch will not result in the flushing of the TLB – but just changing the tag of the current address space to the tag of the address space of the new task.

Does the above confirm for newer Intel CPUs the TLB doesn't get flushed on context switches?

Does this mean there is no real overhead now in a context-switch?

(I am trying to understand the performance penalty of a context-switch)

TLB effects are only one part of the equation of the overhead of context switching. The overhead can never be completely eliminated, but changes in architecture like the above quote can help mitigate the overhead. There is no one answer to what that overhead is, because it depends highly on the hardware you have, the exact version of the operating system you have, the configured options in the kernel, the compiler and optimization levels used to build the kernel, and quite a few other things... — twalberg
@twalberg would you be able to give some very high-level examples regarding operating system/kernel overhead? — user997112
Your best bet is to probably pick the OS you're interested in (e.g. Linux), and look at the source code for the bits involved in context switching, including at least 1) the scheduling decision (what runs next?), 2) what adjustments to VM, TLB and other cache structures need to be made to switch, 3) what data needs to be saved / loaded (registers, floating point state, etc.), 4) does any of the above need to be broadcast to other CPUs (e.g. TLB shootdowns, etc.)... It's not exactly a simple topic... — twalberg
Also, after a context switch, the new process will quite likely run with a very cold processor cache. That alone can easily dwarf the cost of a cold TLB. — cmaster - reinstate monica
Seeing how the TLB is tiny (16 on Core2, 64 on Core i7), this likely isn't much of an improvement anyway. If another process only touches a few dozen kilobytes of memory during its time quantum, your TLB is completely gone either way, tagged or not. — Damon

osgx osgx · Accepted Answer · 2014-03-15T08:18:48

As wikipedia knows in its Context switch article, "context switch is the process of storing and restoring the state (context) of a process so that execution can be resumed from the same point at a later time.". I'll assume context switch between two processes of the same OS, not the user/kernel mode transition (syscall) which is much faster and needs no TLB flush.

So, there is lot of time needed for OS kernel to save execution state (all, really all, registers; and many special control structures) of current running process to memory, and then load execution state of other process (read in from memory). TLB flush, if needed, will add some time to the switch, but it is only small part of total overhead.

If you want to find context switch latency, there is lmbench benchmark tool http://www.bitmover.com/lmbench/ with LAT_CTX test http://www.bitmover.com/lmbench/lat_ctx.8.html

I can't find results for nehalem (is there lmbench in phoronix suite?), but for core2 and modern Linux context switch may cost 5-7 microseconds.

There are also results for lower-quality test http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html with 1-3 microseconds for context switch. Can't get exact effect of non-flushing the TLB from his results.

UPDATE - Your question should be about Virtualization, not about process context switch.

RWT says in their article about Nehalem "Inside Nehalem: Intel’s Future Processor and System. TLBs, Page Tables and Synchronization" April 2, 2008 by David Kanter, that Nehalem added VPID to the TLB to make virtual machine/host switches (vmentry/vmexit) faster:

Nehalem’s TLB entries have also changed subtly by introducing a “Virtual Processor ID” or VPID. Every TLB entry caches a virtual to physical address translation ... that translation is specific to a given process and virtual machine. Intel’s older CPUs would flush the TLBs whenever the processor switched between the virtualized guest and the host instance, to ensure that processes only accessed memory they were allowed to touch. The VPID tracks which VM a given translation entry in the TLB is associated with, so that when a VM exit and re-entry occurs, the TLBs do not have to be flushed for safety. .... The VPID is helpful for virtualization performance by lowering the overhead of VM transitions; Intel estimates that the latency of a round trip VM transition in Nehalem is 40% compared to Merom (i.e. the 65nm Core 2) and about a third lower than the 45nm Penryn.

Also, you should know, that in the fragment cited by you in the question, the "[18]" link was to "G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig. Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization. Intel Technology Journal, 10(3).", so this is feature for effective virtualization (fast guest-host switches).

What is the overhead of a context-switch?

4 Answers