Guaranteed Atomic Operations on POD types naturally aligned on Intel

Question

I have a C++ multi-thread applications running on Intel Xeon 32 cores, compiled with GCC 4.8.2 with optimizations enabled.

I have multiple threads (say A,B,C) that update some POD types, and another thread D that every K seconds reads those variables and send it to a GUI. The threads are spawn across multiple cores and sockets. The writes are protected by a spin-lock. Thread A,B,C are latency sensitive where high performance is a critical aspect. Thread D is not latency sensitive.

Something like:

Thread A,B,C
...
// a,b,c are up to 64 bits (let's say double)
spin-lock
a = computeValue();
b = computeValue();
c = computeValue();
spin-unlock
....

Thread D
...
// a,b,c are up to 64 bits (let's say double)
currValueA = a;
currValueB = b;
currValueC = c;
sendToGui(currValueA ,currValueB ,currValueC );
....

I want to take advantage of Paragraph 8.1.1 https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.html, about guaranteed atomic operations, and avoid to put a lock protecting the reads made by thread D.

My understanding is that if a,b,c are naturally aligned (with a size no bigger that 64 bits) there is no risk that Thread D could read a value for a,b,c that is taken halfway during the write. In other words the writes and reads will be carried out atomically. The thread D will read either the old value or the new.

Is my understanding correct?

I left to the compiler GCC 4.8.2 to take care of the alignment, i.e. I don't use any gcc built-in directives or functions like std::alignas, sts::alignof, etc.

I am aware that the code is not portable. I would prefer not to use std::atomic to avoid any unnecessary overhead.

with std::atomic and specify target architecture, compiler might do the optimization by itself. So your code would be portable. let compiler job to compiler. — Jarod42
Have you found any "unnecessary overhead" caused by the use of std::atomic? — Pete Becker
Using std::atomic with memory_order_relaxed won't have any overhead at all for pure loads and pure stores (and will ensure correct alignment for 64-bit values even in 32-bit code, where alignof(int64_t)=4 on Linux), but it will interfere with auto-vectorization. (Avoid using any atomic RMW operations, of course). If you care about performance, consider using a newer compiler, like gcc7 or gcc8. There have been various improvements since 4.8. — Peter Cordes
Thanks for your comments. I compared the assemblies generated with an atomic store (memory_relaxed and target specified) and a normal assignment. They look a bit different. I see two more instructions: movabsq and movq. I can't say though if they have any impact on performance. I should also definitely try out the newer g++. Thanks. — rdil2503
If you want to program "to the metal" using the Intel spec, are you going to use inline assembly? — curiousguy

Pete Becker Pete Becker · Accepted Answer · 2019-02-13T17:06:23

Reading a value "taken halfway during the write" is only one aspect of atomicity.

Processors these days keep values in processor-specific caches, so on a multi-processor system, two different processors may well have different values for the a that they share. Marking a as atomic ensures that different processors see "the same" value.

In addition, the compiler and the processor often reorder calculations in order to make better use of processing facilities. That's all okay, so long as the result of those calculations isn't changed. (That's the "as if" rule in C++). But "isn't changed" refers to execution within a single thread. Optimizations that work in a single thread don't necessarily work when multiple threads are pounding on the same object. And you don't, in general, want your single-threaded code to be compiled by a paranoid compiler that doesn't do common optimizations because they might break multi-threaded code. Instead, marking an object as atomic says that the compiler should be very careful about what things it moves around, because that object's value can be changed behind the scenes by some other code.

So you have a choice: hand-roll your code and hope that you get it right, or accept that the writer of the atomic library probably knows more about atomicity on your target system that you do, and will probably do a better job.

Guaranteed Atomic Operations on POD types naturally aligned on Intel

1 Answers