compare and exchange on x86 cpu

Question

On a intel i3, i5, i7 x86 64 bits cpu, for instance, is it true that CAS only guarantee atomic on max. 8 bytes object size?

on x86 cpu, lock instruction is add to CAS operation, eg. CMPXCHG, which means the the whole cache line is locked for the reading cpu, so std::atomic::compare_exchange_weak() will not return fault for spurious failure reason?

If x86 cpu use lock at CAS operation, what is the performance gain if use lock free programming instead of use std::mutex on share resource?

If i want to write a lock free linked-list for example. I do atomic load on the header node's pointer and compare it with std::atomic::compare_exchange_weak() to see any changes has been made. In this case, does ABA problem apply on x86 cpu?

Cameron Cameron · Accepted Answer · 2016-01-17T19:11:08

You've got several questions here :-)

A CAS operation is always atomic, by definition. Having said that, it's up to your hardware what (if any) CAS operations are supported, including the maximum number of bytes that can be swapped for such an operation. Some CPUs (e.g. ARM) don't directly support CAS at all. x86-64 CPUs support 8-byte CAS, and all modern ones also support 16-byte CAS (commonly referred to as double-width CAS) via the CMPXCHG16b instruction.
I'm not sure if CAS can fail spuriously on the CPUs you mention (though I know on some platforms it doesn't). I don't know enough about the cache architecture on those specific CPUs. However, the underlying hardware implementation is irrelevant when choosing between compare_exchange_weak and compare_exchange_strong: you should use the one that makes sense for what you're doing -- a weak exchange if you're just writing a typical small CAS loop (where the additional work on spurious failure is negligible), and a strong exchange otherwise. This also makes your code more portable and robust.
You'll need to measure. It depends almost entirely on what your application is doing (and if your application really is bottlenecked by extremely high contention around shared resources, it could probably benefit from a redesign that requires less sharing in the first place). In general std::mutex is "good enough" (actually quite performant most of the time), but if you have very small amounts of work inside the lock under high contention, the locking overhead could indeed eclipse the amount of actual work, and a lock-free algorithm can increase throughput significantly.
ABA absolutely applies on x86. It's a problem that arises due to the fundamental nature of CAS itself, regardless of the hardware implementation. I wrote a little about it here.

My advice is to be very careful when writing lock-free code. It's extremely difficult to test, and may work (even in production) for years before a bug becomes visible in some corner case (e.g. on slightly different hardware, or when used under different workloads, etc.). Don't start out writing a general-purpose data structure like a linked-list, because properly handling insertions and deletions is a nightmare in the general case (at least, it is if your goal is to maintain high performance under contention, which it usually is since that's why you ended up writing lock-free data structure in the first place). Instead, figure out the exact minimum operations that your specific application logic requires, and implement only those. An add-only lock-free linked list is fairly trivial to write; incorporating the ability to remove either the head or tail is much, much more complex, thanks to ABA.

compare and exchange on x86 cpu

1 Answers