Why are mutexes different from atomic operations in that the former is OS level and the latter is processor level?

Question

I am reading the book C++ Concurrency in Action and in page 236-237, it says

The effects of contention with mutexes are usually different from the effects of contention with atomic operations for the simple reason that the use of a mutex naturally serializes threads at the operating system level rather than at the processor level. If you have enough threads ready to run, the operating system can schedule another thread to run while one thread is waiting for the mutex, whereas a processor stall prevents any threads from running on that processor.

I searched the difference between mutexes and atomic operations in stackoverflow and I did not find any answer specifically explains this. I always think mutexes are also implemented by atomic operations (or both are implemented by same hardware instructions). So, why threads using mutexes can be interrupted but threads using atomic operations cannot be?

The first answer in the question "Which is more efficient, basic mutex lock or atomic integer?" says "the atomic will lock the memory bus on most platforms...It is impossible to suspend a thread during the memory bus lock, but it is possible to suspend a thread during a mutex lock". Given that mutexes are implemented by atomics, so they are essentially same, right? I guess I might miss something here. Could someone help me with this?

Keep in mind that if the mutex is already locked when a thread attempts to lock/acquire it, then the expected behavior is that the thread will be put to sleep until the muted is unlocked and can be acquired by the thread. Putting a thread to sleep is an operation handled by the OS’s thread scheduler, therefore the OS must be involved in the mutex implementation. An atomic add, otoh, can be implemented entirely in hardware without the OS’s participation. — Jeremy Friesner
Ah, your comment is the best! Previously, I thought mutex is implemented like this > class mutex { atomic var; void lock() { while(cmpexchg != true); } void unlock() { var = 0; } }; Now I understand it is not like this. There will be some OS-involved code inside lock(), so that mutex could be interrupted by OS. Right? @JeremyFriesner — Elaine
Right -- the mechanism you were thinking of is called a spinlock, and is only used in a few very specific applications, because (as you can imagine) it can be very CPU-inefficient to spin the CPU in a while-loop like that. The mutex has the advantage that the thread uses no CPU cycles while it is waiting. — Jeremy Friesner

bazza bazza · Accepted Answer · 2018-06-23T21:17:10

An atomic operation is, classically, an op code that does a Test-and-Set. Basically this would test a value in memory and, if it is zero (for example), increment it. Whilst this is going on the CPU won't let any other core access that location, and the Test-and-Set is guaranteed to complete uninterrupted. So when called, the end result is either the value has been incremented and your program goes down a particular branch, or it hasn't and your program goes down a different branch.

Not all CPUs have one of these - the 68000 family did, the PowerPC family did not (I think - corrections welcome. I know it was a right nuisance in PowerPC VME systems, where as the 68000 based previous generation machines could to a test-and-set on remote boards), pretty sure earlier X86s didn't either. I'm pretty sure that all major modern CPUs do - it's very useful.

Effectively a Test-and-Set gives you a counting semaphore, and that is what they're used for. However, with only a little bit of chicanery in a library it can also be used as a mutex (which is a binary semaphore that can be given only by the thread that took it).

AFAIK semaphores and mutexes are implemented these days making use of the Test-and-Set op codes available on the CPUs. However, on platforms where there is no Test-and-Set op code, its behaviour has to be synthesised by the OS, probably involving an ISR, interrupt disabling, and so forth. The end result behaves the same, but it's considerably slower. Also on these platforms an "atomic" has to be synthesised using mutexs to guard the value.

So I suspect that talk of mutexes serialising at the kernel level is referring to systems where a mutex has been implemented by the kernel, and the atomic operations are supported by the CPU.

It's also worth remembering that calls to take / give mutexes involve kernel making scheduling decisions, even if the OS then goes on to use a CPU test-and-set op code to implement the mutual exclusion part of the mutex. Whereas calling a test-and-set op code directly from within your program does not; the kernel has no idea it's even happened. So a mutex is a good way to ensure that high priority threads run first if there is contention, whereas a test-and-set op code likely is not (it will be a first come, first served thing). This will be because the CPU has no concept of thread priority, that's an abstract concept dreamed up by by the OS developers.

You can learn a lot about how this kind of thing is done by rooting around inside the source code for the Boost C++ library. Things like shared pointers depend on mutual exclusion, and Boost can implement mutual exclusion in a number of different ways. For example, using test-and-set style op-codes on platforms that have them, or by using the POSIX mutex library function calls, or if you tell it that there is only 1 thread in your program it won't bother at all.

It's worthwhile for Boost to implement its own mutual exclusion mechanisms using op-codes where it can; it doesn't need it to function inter-process (simply inter-thread), whereas a full-on POSIX mutex is inter-process, overkill for Boost's requirements.

With Boost you can override the default selection using a few #defines. So you can speed up a single threaded program by getting it compiled without the mutual exclusion in shared pointers. That is occasionally genuinely useful. What I don't know is if that's been lost in C++ 11 and onwards, now that they've absorbed smart pointers and made them their own.

EDIT

It's also worth taking a look at futexes, which is what Linux uses as the underpinnings for mutexes, semaphores, etc. The idea of a futex is to use atomic operations to implement the bulk of the functionality entirely in user space, resorting to system calls only when absolutely necessary. The result is that, so long as there's not too much contention, a high level thing like a mutex or semaphore is very much more efficient than in the bad old days when they always used to result in a system call. FUTEXes have been around in Linux since about 2003, so we've had the benefit of them for 15 years now. Basically there's no point worrying too much about the efficiency of mutexes vs atomic operations - they're not too far off being the same thing .

What's likely more important is to aim for clean, tidy source code that's easy to read and using the library calls that help with that. Using, say, atomic operations over mutexes at the expense of simple source code is likely not worth it. Certainly on platforms like VxWorks, which don't really have the concept of kernel / user space in the first place and are engineered around lightning-fast context switch times, one can afford to be profligate with the use of mutexes and semaphores to achieve simplicity.

For example, using a mutex to control which thread had access to a particular network socket is a way to use the kernel and thread priorities to manage the priorities of different types of message being sent through that socket. The source code is beautifully simple - threads merely take / give the mutex around using the socket, and that's all there is. No queue manager, no prioritisation decision making code, nothing. All of that is done by the OS scheduling threads in response to mutex takes / gives. On VxWorks this ends up being pretty efficient, benefited from the OS resolving priority inversions, and took very little time to develop. On Linux, especially one with the PREEMPT_RT patch set applied and running as real time priority threads, it's also not too bad (because that also resolves priority inversions, something that I gather Linus doesn't much care for). Whereas on an OS that doesn't have FUTEXes underpinning mutexes and also has expensive context switch times (e.g. Windows), it would be inefficient.

Why are mutexes different from atomic operations in that the former is OS level and the latter is processor level?

2 Answers