32
votes

Originally I believed the overhead to a context-switch was the TLB being flushed. However I just saw on wikipedia:

http://en.wikipedia.org/wiki/Translation_lookaside_buffer

In 2008, both Intel (Nehalem)[18] and AMD (SVM)[19] have introduced tags as part of the TLB entry and dedicated hardware that checks the tag during lookup. Even though these are not fully exploited, it is envisioned that in the future, these tags will identify the address space to which every TLB entry belongs. Thus a context switch will not result in the flushing of the TLB – but just changing the tag of the current address space to the tag of the address space of the new task.

Does the above confirm for newer Intel CPUs the TLB doesn't get flushed on context switches?

Does this mean there is no real overhead now in a context-switch?

(I am trying to understand the performance penalty of a context-switch)

4
TLB effects are only one part of the equation of the overhead of context switching. The overhead can never be completely eliminated, but changes in architecture like the above quote can help mitigate the overhead. There is no one answer to what that overhead is, because it depends highly on the hardware you have, the exact version of the operating system you have, the configured options in the kernel, the compiler and optimization levels used to build the kernel, and quite a few other things...twalberg
@twalberg would you be able to give some very high-level examples regarding operating system/kernel overhead?user997112
Your best bet is to probably pick the OS you're interested in (e.g. Linux), and look at the source code for the bits involved in context switching, including at least 1) the scheduling decision (what runs next?), 2) what adjustments to VM, TLB and other cache structures need to be made to switch, 3) what data needs to be saved / loaded (registers, floating point state, etc.), 4) does any of the above need to be broadcast to other CPUs (e.g. TLB shootdowns, etc.)... It's not exactly a simple topic...twalberg
Also, after a context switch, the new process will quite likely run with a very cold processor cache. That alone can easily dwarf the cost of a cold TLB.cmaster - reinstate monica
Seeing how the TLB is tiny (16 on Core2, 64 on Core i7), this likely isn't much of an improvement anyway. If another process only touches a few dozen kilobytes of memory during its time quantum, your TLB is completely gone either way, tagged or not.Damon

4 Answers

22
votes

As wikipedia knows in its Context switch article, "context switch is the process of storing and restoring the state (context) of a process so that execution can be resumed from the same point at a later time.". I'll assume context switch between two processes of the same OS, not the user/kernel mode transition (syscall) which is much faster and needs no TLB flush.

So, there is lot of time needed for OS kernel to save execution state (all, really all, registers; and many special control structures) of current running process to memory, and then load execution state of other process (read in from memory). TLB flush, if needed, will add some time to the switch, but it is only small part of total overhead.

If you want to find context switch latency, there is lmbench benchmark tool http://www.bitmover.com/lmbench/ with LAT_CTX test http://www.bitmover.com/lmbench/lat_ctx.8.html

I can't find results for nehalem (is there lmbench in phoronix suite?), but for core2 and modern Linux context switch may cost 5-7 microseconds.

There are also results for lower-quality test http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html with 1-3 microseconds for context switch. Can't get exact effect of non-flushing the TLB from his results.

UPDATE - Your question should be about Virtualization, not about process context switch.

RWT says in their article about Nehalem "Inside Nehalem: Intel’s Future Processor and System. TLBs, Page Tables and Synchronization" April 2, 2008 by David Kanter, that Nehalem added VPID to the TLB to make virtual machine/host switches (vmentry/vmexit) faster:

Nehalem’s TLB entries have also changed subtly by introducing a “Virtual Processor ID” or VPID. Every TLB entry caches a virtual to physical address translation ... that translation is specific to a given process and virtual machine. Intel’s older CPUs would flush the TLBs whenever the processor switched between the virtualized guest and the host instance, to ensure that processes only accessed memory they were allowed to touch. The VPID tracks which VM a given translation entry in the TLB is associated with, so that when a VM exit and re-entry occurs, the TLBs do not have to be flushed for safety. .... The VPID is helpful for virtualization performance by lowering the overhead of VM transitions; Intel estimates that the latency of a round trip VM transition in Nehalem is 40% compared to Merom (i.e. the 65nm Core 2) and about a third lower than the 45nm Penryn.

Also, you should know, that in the fragment cited by you in the question, the "[18]" link was to "G. Neiger, A. Santoni, F. Leung, D. Rodgers, and R. Uhlig. Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization. Intel Technology Journal, 10(3).", so this is feature for effective virtualization (fast guest-host switches).

19
votes

Let's break the cost of a task switch into "direct costs" (the cost of the task switch code itself) and "indirect costs" (the cost of TLB misses, etc).

Direct Costs

For direct costs, this is mostly the cost of saving the (architecturally visible to user-space) state for the previous task and then loading the sate for the next task. This varies depending on the situation, mostly because it may or may not include FPU/MMX/SSE/AVX state which can add up to several KiB of data (especially if AVX is involved - e.g. AVX2 is 512 bytes by itself, and AVX-512 is over 2 KiB by itself).

Note that there is a "lazy state load" mechanism to avoid the cost of loading (some or all) FPU/MMX/SSE/AVX state, and avoid the cost of saving that state if it wasn't loaded; and this feature can be disabled for performance reasons (if almost all tasks use the state then the cost of a "state is being used needs to be loaded" trap/exception exceeds what you save from trying to avoid doing it during the task switch) or for security reasons (e.g. because the code in Linux does "save if used" and not "save then clear if used" and leaves data belonging to one task in registers that can be obtained by a different task via. speculative execution attacks).

There is also some other costs (updating statistics - e.g. "amount of CPU time used by previous task"), determining if the new task uses the same virtual address space as the old task (e.g. different thread in same process), etc.

Indirect Costs

Indirect costs is essentially the loss of effectiveness for all the "cache like" things the CPU has - the caches themselves, the TLBs, higher level paging structure caches, all the branch prediction stuff (branch direction, branch target, return buffer), etc.

Indirect costs can be split into 3 causes. One is indirect costs which occur because the thing was completely flushed by the task switch. In the past, this was mostly limited to TLB misses caused because the TLBs was flushed during the task switch. Note that this can happen even when PCID is being used - there's a limit of 4096 IDs (and when "meltdown mitigation" is being used the IDs are used in pairs - for each virtual address space one ID is used for user-space and another for kernel) which means that when there's more than 4096 (or 2048) virtual address spaces being used the kernel has to recycle previously used IDs and flush all TLBs for the ID that is being re-purposed. However, now (with all the speculative execution security problems) the kernel might flush other things (e.g. branch prediction stuff) so that information can't leak from one task to another, but I really don't know if Linux does or doesn't support this for which "cache like" things (and I suspect they primarily try to prevent data leaking from kernel to user-space and end up preventing data leaking from one task to another by accident).

Another cause of indirect costs is capacity limits. For example, if the L2 cache is only able to cache a maximum of 256 KiB of data and the previous task used more then 256 KiB of data; then the L2 cache will be full of data that is useless for the next task and all of the data that the next task wants cached (and previously had cached) will have been evicted due to "least recently used". This applies to all of the "cache like" things (including TLBs and higher level paging structure caches, even when PCID feature is being used).

The other cause of indirect costs is migrating a task to a different CPU. This depends on which CPUs - e.g. if the task is migrated to a different logical CPU within the same core then a lot of the "cache like" things may be shared by both CPUs and the migration costs may be relatively small; and if the task is migrated to a CPU in a different physical package then none of the "cache like" things may be shared by both CPUs and the migration costs may be relatively large.

Note that the upper limit for the magnitude of indirect costs depends on what the task does. For example, if a task uses a large amount of data then the indirect costs may be relatively expensive (lots of cache and TLB misses), and if the task uses a tiny amount of data then the indirect costs may negligible (very little cache and TLB misses).

Unrelated

Note that the PCID feature has its own costs (not related to task switches themselves). Specifically; when page translations are modified on one CPU they may need to be invalidated on other CPUs using something called "multi-CPU TLB shootdown", which is relatively expensive (involves an IPI/Inter-processor Interrupt that disrupts other CPUs and costs "low hundreds of cycles" per CPU). Without PCID you can avoid some of these. For example, without PCID, for a single-threaded process that is running on one CPU you know that no other CPU can be using the same virtual address space and therefore know that you don't need to do the "multi-CPU TLB shootdown", and if a multi-threaded process is limited to a single NUMA domain then only CPUs within that NUMA domain need to be involved in the "multi-CPU TLB shootdown". When PCID is being used you can't rely on these tricks and have higher overhead because "multi-CPU TLB shootdown" isn't avoided as often.

Of course there's also some cost associated with ID management (e.g. figuring out which ID is free to assign to a newly created task, revoking IDs when tasks are terminated, some kind of "least recently used" system for re-purposing IDs when there's more virtual address spaces than IDs, etc).

Due to these costs there's bound to be pathological cases where the cost of using PCID exceeds the "less TLB misses caused by task switches" benefits (where using PCID makes performance worse).

8
votes

If we count in cache invalidation (which we usually should, and which is The Largest Contributor to context switch costs in real-world), performance penalty due to context switch can be HUGE:

https://www.usenix.org/legacy/events/expcs07/papers/2-li.pdf (admittedly a bit outdated, but the best I was able to find) gives it in the range of 100K-1M CPU cycles. Theoretically, in a worst-possible case for a multi-socket server box with 32M L3 per-socket caches consisting out of 64-byte cache lines, completely random access, and typical access times of 40 cycles for L3/100 cycles for main RAM, the penalty can reach as much as 30M+ CPU cycles(!).

From personal experience, I'd say it is usually in the range of tens of K cycles, but depending on specifics, it can differ by an order of magnitude.

-1
votes

Note: as Brendan has pointed out in his comment. The objective of this answer is to answer the detail What is the overall impact of context switches on a windows server/desktop performance including operating system overheads which are different on Windows vs Linux vs Solaris etc....

The best way to find out is of course to benchmark it. The issue here is that the relationship between number of context switches per second and CPU time is exponential. In other words it is an O(n2) cost. Meaning we have a max limit that simply cannot be exceeded.

The following benchmark code uses a few unsafe variables etc... ignore this as that is not the point.

The actual work done per thread is minimal. In theory each thread should generate 1000 context switches per second.

  • Dump the following code into a .NET console app and see the results in perfmon.
  • Add two counters to Perfmon: Processor->%Processor time and System->Context Switches Per Second. On an 8-core machine 128 threads generates about 0.1% CPU overhead from the work done by the threads.

It would stand to reason that 2560 threads should generate about 2% CPU but instead CPU goes to 100% at 2300 threads (on my Core i7-4790K 4 Core + 4 hyperthreaded Core desktop machine).

  • 2048 threads - 2 million context switches per second: CPU at 40%
  • 2300 threads - 2.3 million context switches per second: CPU at 100%

perfmon graph

static void Main(string[] args)
{
    ThreadTestClass ThreadClass;
    bool Wait;
    int Counter;
    Wait = true;
    Counter = 0;
    while (Wait)
    {
        if (Console.KeyAvailable)
        {
            ConsoleKey Key = Console.ReadKey().Key;
            switch (Key)
            {
                case ConsoleKey.UpArrow:
                    ThreadClass = new ThreadTestClass();
                    break;
                case ConsoleKey.DownArrow:
                    SignalExitThread();
                    break;
                case ConsoleKey.PageUp:
                    SleepTime += 1;
                    break;
                case ConsoleKey.PageDown:
                    SleepTime -= 1;
                    break;
                case ConsoleKey.Insert:
                    for (int I = 0; I < 64; I++)
                    {
                        ThreadClass = new ThreadTestClass();
                    }
                    break;
                case ConsoleKey.Delete:
                    for (int I = 0; I < 64; I++)
                    {
                        SignalExitThread();
                    }
                    break;
                case ConsoleKey.Q:
                    Wait = false;
                    break;
                case ConsoleKey.Spacebar:
                    Wait = false;
                    break;
                case ConsoleKey.Enter:
                    Wait = false;
                    break;
            }
        }
        Counter += 1;
        if (Counter >= 10)
        {
            Counter = 0;
            Console.WriteLine(string.Concat(@"Thread Count: ", NumThreadsActive.ToString(), @" - SleepTime: ", SleepTime.ToString(), @" - Counter: ", UnSafeCounter.ToString()));
        }
        System.Threading.Thread.Sleep(100);
    }
    IsActive = false;
}

public static object SyncRoot = new object();
public static bool IsActive = true;
public static int SleepTime = 1;
public static long UnSafeCounter = 0;
private static int m_NumThreadsActive;
public static int NumThreadsActive
{
    get
    {
        lock(SyncRoot)
        {
            return m_NumThreadsActive;
        }
    }
}
private static void NumThreadsActive_Inc()
{
    lock (SyncRoot)
    {
        m_NumThreadsActive += 1;
    }
}
private static void NumThreadsActive_Dec()
{
    lock (SyncRoot)
    {
        m_NumThreadsActive -= 1;
    }
}
private static int ThreadsToExit = 0;
private static bool ThreadExitFlag = false;
public static void SignalExitThread()
{
    lock(SyncRoot)
    {
        ThreadsToExit += 1;
        ThreadExitFlag = (ThreadsToExit > 0);
    }
}

private static bool ExitThread()
{
    if (ThreadExitFlag)
    {
        lock (SyncRoot)
        {
            ThreadsToExit -= 1;
            ThreadExitFlag = (ThreadsToExit > 0);
            return (ThreadsToExit >= 0);
        }
    }
    return false;
}

public class ThreadTestClass
{
    public ThreadTestClass()
    {
        System.Threading.Thread RunThread;
        RunThread = new System.Threading.Thread(new System.Threading.ThreadStart(ThreadRunMethod));
        RunThread.Start();
    }

    public void ThreadRunMethod()
    {
        long Counter1;
        long Counter2;
        long Counter3;
        Counter1 = 0;
        NumThreadsActive_Inc();
        try
        {
            while (IsActive && (!ExitThread()))
            {
                UnSafeCounter += 1;
                System.Threading.Thread.Sleep(SleepTime);
                Counter1 += 1;
                Counter2 = UnSafeCounter;
                Counter3 = Counter1 + Counter2;
            }
        }
        finally
        {
            NumThreadsActive_Dec();
        }
    }
}