0
votes

I have some doubts about the C++11/C11 memory model that I was wondering if anyone can clarify. These are questions about the model/abstract machine, not about any real architecture.


  1. Are acquire/release effects guaranteed to "cascade" from one thread to the next?

Here is a pseudo code example of what I mean (assume all variables start as 0)

[Thread 1]
store_relaxed(x, 1);
store_release(a, 1);

[Thread 2]
while (load_acquire(a) == 0);
store_release(b, 1);

[Thread 3]
while (load_acquire(b) == 0);
assert(load_relaxed(x) == 1);

Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct? Or do we need to use seq cst here in order to be guaranteed that the assert will not fire? I have a feeling acquire/release is enough, but I can't quite find any simple explanation that guarantees it. Most explanations of acquire/release mainly focus on the acquiring thread receiving all the stores made by the releasing thread. However in the example above, Thread 2 never touches variable x, and Thread 1/Thread 3 do not touch the same atomic variable. It's obvious that if Thread 2 were to load x, it would see 1, but is that state guaranteed to cascade over into other threads which subsequently do an acquire/release sync with Thread 2? Or does Thread 3 also need to do an acquire on variable a in order to receive Thread 1's write to x?

According to https://en.cppreference.com/w/cpp/atomic/memory_order:

All writes in the current thread are visible in other threads that acquire the same atomic variable

All writes in other threads that release the same atomic variable are visible in the current thread

Since Thread 1 and Thread 3 don't touch the same atomic variable, I'm not sure if acquire/release alone is enough for the above case. There's probably an answer hiding in the formal description, but I can't quite work it out.

*EDIT: Didn't notice until after the fact, but there is an example at the link I posted ("The following example demonstrates transitive release-acquire ordering...") that is almost the same as my example, but it uses the same atomic variable across all three threads, which seems like it might be significant. I am specifically asking about the case where the variables are not the same.


  1. Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?

Imagine there is a function "get_data" that allocates a buffer, writes some data to it, and returns a pointer to the buffer. And there is a function "use_data" that takes the pointer to the buffer and does something with the data. Thread 1 gets a buffer from get_data and passes it to Thread 2 using a relaxed atomic store to a global atomic pointer. Thread 2 does relaxed atomic loads in a loop until it gets the pointer, and then passes it off to use_data:

int* get_data() {...}
void use_data(int* buf) {...}
int* global_ptr = nullptr;

[Thread 1]
int* buf = get_data();
super_duper_memory_fence();
store_relaxed(global_ptr, buf);

[Thread 2]
int* buf = nullptr;
while ((buf = load_relaxed(global_ptr)) == nullptr);
use_data(buf);

Is there any kind of operation at all that can be put in "super_duper_memory_fence", that will guarantee that by the time use_data gets the pointer, the data in the buffer is also visible? It is my understanding that there is not a portable way to do this, and that Thread 2 must have a matching fence or other atomic operation in order to guarantee that it receives the writes made into the buffer and not just the pointer value. Is this correct?

1
First, you would have to ask yourself why you are doing this. I rarely see good motivation for using relaxed store and load in these situations. Use a release store and an acquire read and your problem disappears. Then, if you really measure a performance problem, revisit this. Second, there are fences in the standard, but they are not easy to use, because even for them you have to establish a happened-before relationship before you can argue about any side effects that you are supposed to see (or not).Jens Gustedt
>First, you would have to ask yourself... <- I don't disagree with what you said, but in this case I'm not thinking of a specific use case. No I'm not working on a super computer with thousands of weak memory order CPUs, that would actually benefit from this. I simply realized that I didn't fully understand how the spec expects memory ordering to work, and that bothered me because I like knowing as many details about the tools that I use as possible.Anon

1 Answers

2
votes

Thread 3's acquire syncs with Thread 2's release, which comes after Thread 2's acquire which syncs with Thread 1's release. Therefore, Thread 3 is guaranteed to see the value that Thread 1 set to x, correct?

Yes, this is correct. The acquire/release operations establish synchronize-with relations - i.e., store_release(a) synchronizes-with load_acquire(a) and store_release(b) synchronizes-with load_acquire(b). And load_acquire(a) is sequenced-before store_release(b). synchronize-with and sequenced-before are both part of the happens-before definition, and the happens-before relation is transitive. Therefore, store_relaxed(x, 1) happens-before load_relaxed(x).

Am I right in believing that according to the standard, there must always be a pair of non-relaxed atomic operations, one in each thread, in order for any kind of memory ordering at all to be guaranteed?

This is question is a bit too broad, but overall I would tend to say "yes". In general you have to ensure that there is a proper happens-before relation when operating on some (non-atomic) shared data. If one thread writes to some shared data and some other thread should read that data, you have to ensure that the write happens-before the read. There are different ways to achieve this - atomics with the correct memory orderings are just one way (although one could argue that almost all other methods (like std::mutex) also boil down to atomic operations).

Fences also have to be combined with other fences or atomic operations. Your example would work if super_duper_memory_fence() were a std::atomic_thread_fence(std::memory_order_release) and you put another std::atomic_thread_fence(std::memory_order_acquire) before your call to use_data.

For more details I can recommend this paper which I have co-authored: Memory Models for C/C++ Programmers