27
votes

I tried looking for details on this, I even read the standard on mutexes and atomics... but still I couldnt understand the C++11 memory model visibility guarantees. From what I understand the very important feature of mutex BESIDE mutual exclusion is ensuring visibility. Aka it is not enough that only one thread per time is increasing the counter, it is important that the thread increases the counter that was stored by the thread that was last using the mutex(I really dont know why people dont mention this more when discussing mutexes, maybe I had bad teachers :)). So from what I can tell atomic doesnt enforce immediate visibility: (from the person that maintains boost::thread and has implemented c++11 thread and mutex library):

A fence with memory_order_seq_cst does not enforce immediate visibility to other threads (and neither does an MFENCE instruction). The C++0x memory ordering constraints are just that --- ordering constraints. memory_order_seq_cst operations form a total order, but there are no restrictions on what that order is, except that it must be agreed on by all threads, and it must not violate other ordering constraints. In particular, threads may continue to see "stale" values for some time, provided they see values in an order consistent with the constraints.

And I'm OK with that. But the problem is that I have trouble understanding what C++11 constructs regarding atomic are "global" and which only ensure consistency on atomic variables. In particular I have understanding which(if any) of the following memory orderings guarantee that there will be a memory fence before and after load and stores: http://www.stdthread.co.uk/doc/headers/atomic/memory_order.html

From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.

So can somebody clear this up, I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)
2. if I'm right does that mean that second line is redundand in this code:

atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);  

3. do std::atomic_thread_fences have same requirements as mutexes in a sense that to ensure seq consistency on nonatomic vars one must do std::atomic_thread_fence(std::memory_order_seq_cst); before load and std::atomic_thread_fence(std::memory_order_seq_cst);
after stores?
4. Is

  {
    regularSum+=atomicVar.load();
    regularVar1++;
    regularVar2++;
    }
    //...
    {
    regularVar1++;
    regularVar2++;
    atomicVar.store(74656);
  }

equivalent to

std::mutex mtx;
{
   std::unique_lock<std::mutex> ul(mtx);
   sum+=nowRegularVar;
   regularVar++;
   regularVar2++;
}
//..
{
   std::unique_lock<std::mutex> ul(mtx);
    regularVar1++;
    regularVar2++;
    nowRegularVar=(74656);
}

I think not, but I would like to be sure.

EDIT: 5. Can assert fire?
Only two threads exist.

atomic<int*> p=nullptr; 

first thread writes

{
    nonatomic_p=(int*) malloc(16*1024*sizeof(int));
    for(int i=0;i<16*1024;++i)
    nonatomic_p[i]=42;
    p=nonatomic;
}

second thread reads

{
    while (p==nullptr)
    {
    }
    assert(p[1234]==42);//1234-random idx in array
}
2

2 Answers

27
votes

If you like to deal with fences, then a.load(memory_order_acquire) is equivalent to a.load(memory_order_relaxed) followed by atomic_thread_fence(memory_order_acquire). Similarly, a.store(x,memory_order_release) is equivalent to a call to atomic_thread_fence(memory_order_release) before a call to a.store(x,memory_order_relaxed). memory_order_consume is a special case of memory_order_acquire, for dependent data only. memory_order_seq_cst is special, and forms a total order across all memory_order_seq_cst operations. Mixed with the others it is the same as an acquire for a load, and a release for a store. memory_order_acq_rel is for read-modify-write operations, and is equivalent to an acquire on the read part and a release on the write part of the RMW.

The use of ordering constraints on atomic operations may or may not result in actual fence instructions, depending on the hardware architecture. In some cases the compiler will generate better code if you put the ordering constraint on the atomic operation rather than using a separate fence.

On x86, loads are always acquire, and stores are always release. memory_order_seq_cst requires stronger ordering with either an MFENCE instruction or a LOCK prefixed instruction (there is an implementation choice here as to whether to make the store have the stronger ordering or the load). Consequently, standalone acquire and release fences are no-ops, but atomic_thread_fence(memory_order_seq_cst) is not (again requiring an MFENCE or LOCKed instruction).

An important effect of the ordering constraints is that they order other operations.

std::atomic<bool> ready(false);
int i=0;

void thread_1()
{
    i=42;
    ready.store(true,memory_order_release);
}

void thread_2()
{
    while(!ready.load(memory_order_acquire)) std::this_thread::yield();
    assert(i==42);
}

thread_2 spins until it reads true from ready. Since the store to ready in thread_1 is a release, and the load is an acquire then the store synchronizes-with the load, and the store to i happens-before the load from i in the assert, and the assert will not fire.

2) The second line in

atomicVar.store(42);
std::atomic_thread_fence(std::memory_order_seq_cst);  

is indeed potentially redundant, because the store to atomicVar uses memory_order_seq_cst by default. However, if there are other non-memory_order_seq_cst atomic operations on this thread then the fence may have consequences. For example, it would act as a release fence for a subsequent a.store(x,memory_order_relaxed).

3) Fences and atomic operations do not work like mutexes. You can use them to build mutexes, but they do not work like them. You do not have to ever use atomic_thread_fence(memory_order_seq_cst). There is no requirement that any atomic operations are memory_order_seq_cst, and ordering on non-atomic variables can be achieved without, as in the example above.

4) No these are not equivalent. Your snippet without the mutex lock is thus a data race and undefined behaviour.

5) No your assert cannot fire. With the default memory ordering of memory_order_seq_cst, the store and load from the atomic pointer p work like the store and load in my example above, and the stores to the array elements are guaranteed to happen-before the reads.

7
votes

From what I can tell std::memory_order_seq_cst inserts mem barrier while other only enforce ordering of the operations on certain memory location.

It really depends on what you're doing, and on what platform you're working with. The strong memory ordering model on a platform like x86 will create a different set of requirements for the existence of memory fence operations compared to a weaker ordering model on platforms like IA64, PowerPC, ARM, etc. What the default parameter of std::memory_order_seq_cst is ensuring is that depending on the platform, the proper memory fence instructions will be used. On a platform like x86, there is no need for a full memory barrier unless you are doing a read-modify-write operation. Per the x86 memory model, all loads have load-acquire semantics, and all stores have store-release semantics. Thus, in these cases the std::memory_order_seq_cst enum basically creates a no-op since the memory model for x86 already ensures that those types of operations are consistent across threads, and therefore there are no assembly instructions that implement these types of partial memory barriers. Thus the same no-op condition would be true if you explicitly set a std::memory_order_release or std::memory_order_acquire setting on x86. Furthermore, requiring a full memory-barrier in these situations would be an unnecessary performance impediment. As noted, it would only be required for read-modify-store operations.

On other platforms with weaker memory consistency models though, that would not be the case, and therefore using std::memory_order_seq_cst would employ the proper memory fence operations without the user having to explicitly specify whether they would like a load-acquire, store-release, or full memory fence operation. These platforms have specific machine instructions for enforcing such memory consistency contracts, and the std::memory_order_seq_cst setting would work out the proper case. If the user would like to specifically call for one of these operations they can through the explicit std::memory_order enum types, but it would not be necessary ... the compiler would work out the correct settings.

I presume a lot of people are gonna be making horrible bugs using std::atomic , esp if they dont use default (std::memory_order_seq_cst memory ordering)

Yes, if they don't know what they're doing, and don't understand which types of memory barrier semantics that are called for in certain operations, then there will be a lot of mistakes made if they attempt to explicitly state the type of memory barrier and it's the incorrect one, especially on platforms that will not help their mis-understanding of memory ordering because they are weaker in nature.

Finally, keep in mind with your situation #4 concerning a mutex that there are two different things that need to happen here:

  1. The compiler must not be allowed to reorder operations across the mutex and critical section (especially in the case of an optimizing compiler)
  2. There must be the requisite memory fences created (depending on the platform) that maintain a state where all stores are completed before the critical section and reading of the mutex variable, and all stores are completed before exiting the critical section.

Since by default, atomic stores and loads are implemented with std::memory_order_seq_cst, then using atomics would also implement the proper mechanisms to satisfy conditions #1 and #2. That being said, in your first example with atomics, the load would enforce acquire-semantics for the block, while the store would enforce release semantics. It would not though enforce any particular ordering inside the "critical section" between these two operations though. In your second example, you have two different sections with locks, each lock having acquire semantics. Since at some point you would have to release the locks, which would have release semantics, then no, the two code blocks would not be equivalent. In the first example, you've created a big "critical section" between the load and store (assuming this is all happening on the same thread). In the second example you have two different critical sections.

P.S. I've found the following PDF particularly instructive, and you may find it too: http://www.nwcpp.org/Downloads/2008/Memory_Fences.pdf