What is bus-locking in the context of atomic variables?

Question

I use C++ since a long time, and now I'm starting to learn assembly and learn how processors work (not just for fun, but I have to as a part of a test program). While learning assembly, I started hearing some of the terms that I hear here and there when discussing multithreading, given that I do lots of multithreading in scientific computing. I'm struggling to get the full picture, and I'd appreciate helping me to widen my picture.

I learned that a bus, in its simplest form, is something like a multiplexer followed by a demultiplexer. Each of the ends takes an address as input, in order to connect the two ends with some external component. The two ends can, based on the address, point to memory, graphics card, RAM, CPU registers, or anything else.

Now getting to my question: I keep hearing people arguing on whether to use a mutex or an atomic for thread safety (I know there's no ultimate answer, this is not what my question is, but my question is about the comparison). Here for example, the claim was made that atomics are so bad that they will prevent a processor from doing a decent job because of bus-locking.

Could someone please explain what bus-locking is, in a little detail, and why it is not like mutexes, while AFAIK, mutexes need at least two atomic operations to lock and unlock.

Bus-locking happens for atomic read-modify-write operations only. A mutex needs to perform an RMW, too, in order to acquire the lock, so it doesn't prevent bus locking. Bus locking refers to the locking of the memory bus, so no other process can access memory (perhaps only at the location or cache line in question) while the bus is locked. In x86 it's effected by the LOCK prefix (which is implied for a memory-XCHG). — Kerrek SB
(Other architectures may not be able to lock the bus and perform RMW operations in a loop.) — Kerrek SB
@kerrek I don't think that information is up-to-date. The bus (which doesn't even exist any more on modern Intel CPUs) is only used if cheaper cache locks, etc aren't possible (writes straddling cache lines etc). It should be an exceedingly rare event. — Voo

Maxim Egorushkin Maxim Egorushkin · Accepted Answer · 2017-04-12T17:06:44

From Intel® 64 and IA-32 Architectures Software Developer’s Manual:

Beginning with the P6 family processors, when the LOCK prefix is prefixed to an instruction and the memory area being accessed is cached internally in the processor, the LOCK# signal is generally not asserted. Instead, only the processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is carried out atomically with regards to memory.

There are special non-temporal store instructions to bypass the cache. All other loads and stores normally go through the cache, unless the memory page is marked as non-cacheable (like GPU or PCIe device memory).

What is bus-locking in the context of atomic variables?

3 Answers