Context switches much slower in new linux kernels

Question

We are looking to upgrade the OS on our servers from Ubuntu 10.04 LTS to Ubuntu 12.04 LTS. Unfortunately, it seems that the latency to run a thread that has become runnable has significantly increased from the 2.6 kernel to the 3.2 kernel. In fact the latency numbers we are getting are hard to believe.

Let me be more specific about the test. We have a program that runs two threads. The first thread gets the current time (in ticks using RDTSC) and then signals a condition variable once a second. The second thread waits on the condition variable and wakes up when it is signaled. It then gets the current time (in ticks using RDTSC). The difference between the time in the second thread and the time in the first thread is computed and displayed on the console. After this the second thread waits on the condition variable once more. It will be signaled again by the first thread after about a second passes.

So, in a nutshell we get a thread to thread communication via condition variable latency measurement once a second as a result.

In kernel 2.6.32, this latency is somewhere on the order of 2.8-3.5 us, which is reasonable. In kernel 3.2.0, this latency has increased to somewhere on the order of 40-100 us. I have excluded any differences in hardware between the two hosts. They run on identical hardware (dual socket X5687 {Westmere-EP} processors running at 3.6 GHz with hyperthreading, speedstep and all C states turned off). The test app changes the affinity of the threads to run them on independent physical cores of the same socket (i.e., the first thread is run on Core 0 and the second thread is run on Core 1), so there is no bouncing of threads on cores or bouncing/communication between sockets.

The only difference between the two hosts is that one is running Ubuntu 10.04 LTS with kernel 2.6.32-28 (the fast context switch box) and the other is running the latest Ubuntu 12.04 LTS with kernel 3.2.0-23 (the slow context switch box). All BIOS settings and hardware are identical.

Have there been any changes in the kernel that could account for this ridiculous slow down in how long it takes for a thread to be scheduled to run?

Update: If you would like to run the test on your host and linux build, I have posted the code to pastebin for your perusal. Compile with:

g++ -O3 -o test_latency test_latency.cpp -lpthread

Run with (assuming you have at least a dual-core box):

./test_latency 0 1 # Thread 1 on Core 0 and Thread 2 on Core 1

Update 2: After much searching through kernel parameters, posts on kernel changes and personal research, I have figured out what the problem is and have posted the solution as an answer to this question.

just a guess, but maybe changing a parameter from /proc/sys/kernel/* may work? If you find something that works, put that configuration in /etc/sysctl.conf or a file in /etc/sysctl.d/ to make it persist across reboots. — Carlos Campderrós
I compared /proc/sys/kernel between the two hosts, but see no meaningful differences, especially in any scheduling related configuration items. — Michael Goldshteyn
I vaguely recall a rumour that RDTSC is not necessarily properly synchronised between cores, but I would expect that if this were an issue you'd be seeing time reversal. Have you tried fiddling the affinities to run both threads on the same core and seeing what happens? — David Given
On Intel cores this new, RDTSC works flawlessly across cores, especially cores on the same CPU (i.e., the same socket). Interestingly, if both threads are run on the same core, the latencies go down to 4-10 us on the newer kernel and approx. 3 us on the older kernel. — Michael Goldshteyn
Just a general comment - relying on TSCs to be synchronized is iffy at best, although in your specific case, since you're using two cores on one physical chip, it should actually work out. — twalberg

Michael Goldshteyn Michael Goldshteyn · Accepted Answer · 2012-08-24T22:10:54

The solution to the bad thread wake up performance problem in recent kernels has to do with the switch to the intel_idle cpuidle driver from acpi_idle, the driver used in older kernels. Sadly, the intel_idle driver ignores the user's BIOS configuration for the C-states and dances to its own tune. In other words, even if you completely disable all C states in your PC's (or server's) BIOS, this driver will still force them on during periods of brief inactivity, which are almost always happening unless an all core consuming synthetic benchmark (e.g., stress) is running. You can monitor C state transitions, along with other useful information related to processor frequencies, using the wonderful Google i7z tool on most compatible hardware.

To see which cpuidle driver is currently active in your setup, just cat the current_driver file in the cpuidle section of /sys/devices/system/cpu as follows:

cat /sys/devices/system/cpu/cpuidle/current_driver

If you want your modern Linux OS to have the lowest context switch latency possible, add the following kernel boot parameters to disable all of these power saving features:

On Ubuntu 12.04, you can do this by adding them to the GRUB_CMDLINE_LINUX_DEFAULT entry in /etc/default/grub and then running update-grub. The boot parameters to add are:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll

Here are the gory details about what the three boot options do:

Setting intel_idle.max_cstate to zero will either revert your cpuidle driver to acpi_idle (at least per the documentation of the option), or disable it completely. On my box it is completely disabled (i.e., displaying the current_driver file in /sys/devices/system/cpu/cpuidle produces an output of none). In this case the second boot option, processor.max_cstate=0 is unnecessary. However, the documentation states that setting max_cstate to zero for the intel_idle driver should revert the OS to the acpi_idle driver. Therefore, I put in the second boot option just in case.

The processor.max_cstate option sets the maximum C state for the acpi_idle driver to zero, hopefully disabling it as well. I do not have a system that I can test this on, because intel_idle.max_cstate=0 completely knocks out the cpuidle driver on all of the hardware available to me. However, if your installation does revert you from intel_idle to acpi_idle with just the first boot option, please let me know if the second option, processor.max_cstate did what it was documented to do in the comments so that I can update this answer.

Finally, the last of the three parameters, idle=poll is a real power hog. It will disable C1/C1E, which will remove the final remaining bit of latency at the expense of a lot more power consumption, so use that one only when it's really necessary. For most this will be overkill, since the C1* latency is not all that large. Using my test application running on the hardware I described in the original question, the latency went from 9 us to 3 us. This is certainly a significant reduction for highly latency sensitive applications (e.g., financial trading, high precision telemetry/tracking, high freq. data acquisition, etc...), but may not be worth the incurred electrical power hit for the vast majority of desktop apps. The only way to know for sure is to profile your application's improvement in performance vs. the actual increase in power consumption/heat of your hardware and weigh the tradeoffs.

Update:

After additional testing with various idle=* parameters, I have discovered that setting idle to mwait if supported by your hardware is a much better idea. It seems that the use of the MWAIT/MONITOR instructions allows the CPU to enter C1E without any noticeable latency being added to the thread wake up time. With idle=mwait, you will get cooler CPU temperatures (as compared to idle=poll), less power use and still retain the excellent low latencies of a polling idle loop. Therefore, my updated recommended set of boot parameters for low CPU thread wake up latency based on these findings is:

intel_idle.max_cstate=0 processor.max_cstate=0 idle=mwait

The use of idle=mwait instead of idle=poll may also help with the initiation of Turbo Boost (by helping the CPU stay below its TDP [Thermal Design Power]) and hyperthreading (for which MWAIT is the ideal mechanism for not consuming an entire physical core while at the same time avoiding the higher C states). This has yet to be proven in testing, however, which I will continue to do.

Update 2:

The mwait idle option has been removed from newer 3.x kernels (thanks to user ck_ for the update). That leaves us with two options:

idle=halt - Should work as well as mwait, but test to be sure that this is the case with your hardware. The HLT instruction is almost equivalent to an MWAIT with state hint 0. The problem lies in the fact that an interrupt is required to get out of a HLT state, while a memory write (or interrupt) can be used to get out of the MWAIT state. Depending on what the Linux Kernel uses in its idle loop, this can make MWAIT potentially more efficient. So, as I said test/profile and see if it meets your latency needs...

and

idle=poll - The highest performance option, at the expense of power and heat.

Context switches much slower in new linux kernels

3 Answers