16
votes

Does standard C++11 guarantee that memory_order_seq_cst prevents StoreLoad reordering around an atomic operation for non-atomic memory accesses?

As known, there are 6 std::memory_orders in C++11, and its specifies how regular, non-atomic memory accesses are to be ordered around an atomic operation - Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

§ 29.3 Order and consistency

§ 29.3 / 1

The enumeration memory_order specifies the detailed regular (non-atomic) memory synchronization order as defined in 1.10 and may provide for operation ordering. Its enumerated values and their meanings are as follows:

Also known, that these 6 memory_orders prevent some of these reordering:

enter image description here

But, does memory_order_seq_cst prevent StoreLoad reordering around an atomic operation for regular, non-atomic memory accesses or only for other atomic with the same memory_order_seq_cst?

I.e. to prevent this StoreLoad-reordering should we use std::memory_order_seq_cst for both STORE and LOAD, or only for one of it?

std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // Sequential Consistency
a.load(std::memory_order_seq_cst); // Sequential Consistency

About Acquire-Release semantic is all clear, it specifies exactly non-atomic memory-access reordering across atomic operations: http://en.cppreference.com/w/cpp/atomic/memory_order


To prevent StoreLoad-reordering we should use std::memory_order_seq_cst.

Two examples:

  1. std::memory_order_seq_cst for both STORE and LOAD: there is MFENCE

StoreLoad can't be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/mVZJs0

std::atomic<int> a, b;
b.store(1, std::memory_order_seq_cst); // can't be executed after LOAD
a.load(std::memory_order_seq_cst); // can't be executed before STORE
  1. std::memory_order_seq_cst for LOAD only: there isn't MFENCE

StoreLoad can be reordered - GCC 6.1.0 x86_64: https://godbolt.org/g/2NLy12

std::atomic<int> a, b;
b.store(1, std::memory_order_release); // can be executed after LOAD
a.load(std::memory_order_seq_cst); // can be executed before STORE

Also if C/C++-compiler used alternative mapping of C/C++11 to x86, which flushes the Store Buffer before the LOAD: MFENCE,MOV (from memory), so we must use std::memory_order_seq_cst for LOAD too: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html As this example is discussed in another question as approach (3): Does it make any sense instruction LFENCE in processors x86/x86_64?

I.e. we should use std::memory_order_seq_cst for both STORE and LOAD to generate MFENCE guaranteed, that prevents StoreLoad reordering.

Is it true, that memory_order_seq_cst for atomic Load or Store:

  • specifi Acquire-Release semantic - prevent: LoadLoad, LoadStore, StoreStore reordering around an atomic operation for regular, non-atomic memory accesses,

  • but prevent StoreLoad reordering around an atomic operation only for other atomic operations with the same memory_order_seq_cst?

2

2 Answers

8
votes

No, standard C++11 doesn't guarantee that memory_order_seq_cst prevents StoreLoad reordering of non-atomic around an atomic(seq_cst).

Even standard C++11 doesn't guarantee that memory_order_seq_cst prevents StoreLoad reordering of atomic(non-seq_cst) around an atomic(seq_cst).

Working Draft, Standard for Programming Language C++ 2016-07-12: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/n4606.pdf

  • There shall be a single total order S on all memory_order_seq_cst operations - C++11 Standard:

§ 29.3

3

There shall be a single total order S on all memory_order_seq_cst operations, consistent with the “happens before” order and modification orders for all affected locations, such that each memory_order_seq_cst operation B that loads a value from an atomic object M observes one of the following values: ...

  • But, any atomic operations with ordering weaker than memory_order_seq_cst hasn't sequential consistency and hasn't single total order, i.e. non-memory_order_seq_cst operations can be reordered with memory_order_seq_cst operations in allowed directions - C++11 Standard:

§ 29.3

8 [ Note: memory_order_seq_cst ensures sequential consistency only for a program that is free of data races and uses exclusively memory_order_seq_cst operations. Any use of weaker ordering will invalidate this guarantee unless extreme care is used. In particular, memory_order_seq_cst fences ensure a total order only for the fences themselves. Fences cannot, in general, be used to restore sequential consistency for atomic operations with weaker ordering specifications. — end note ]


Also C++-compilers allows such reorderings:

  1. On x86_64

Usually - if in compilers seq_cst implemented as barrier after store, then:

STORE-C(relaxed); LOAD-B(seq_cst); can be reordered to LOAD-B(seq_cst); STORE-C(relaxed);

Screenshot of Asm generated by GCC 7.0 x86_64: https://godbolt.org/g/4yyeby

Also, theoretically possible - if in compilers seq_cst implemented as barrier before load, then:

STORE-A(seq_cst); LOAD-C(acq_rel); can be reordered to LOAD-C(acq_rel); STORE-A(seq_cst);

  1. On PowerPC

STORE-A(seq_cst); LOAD-C(relaxed); can be reordered to LOAD-C(relaxed); STORE-A(seq_cst);

Also on PowerPC can be such reordering:

STORE-A(seq_cst); STORE-C(relaxed); can reordered to STORE-C(relaxed); STORE-A(seq_cst);

If even atomic variables are allowed to be reordered across atomic(seq_cst), then non-atomic variables can also be reordered across atomic(seq_cst).

Screenshot of Asm generated by GCC 4.8 PowerPC: https://godbolt.org/g/BTQBr8


More details:

  1. On x86_64

STORE-C(release); LOAD-B(seq_cst); can be reordered to LOAD-B(seq_cst); STORE-C(release);

Intel® 64 and IA-32 Architectures

8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations

I.e. x86_64 code:

STORE-A(seq_cst);
STORE-C(release); 
LOAD-B(seq_cst);

Can be reordered to:

STORE-A(seq_cst);
LOAD-B(seq_cst);
STORE-C(release); 

This can happen because between c.store and b.load isn't mfence:

x86_64 - GCC 7.0: https://godbolt.org/g/dRGTaO

C++ & asm - code:

#include <atomic>

// Atomic load-store
void test() {
    std::atomic<int> a, b, c;
    a.store(2, std::memory_order_seq_cst);          // movl 2,[a]; mfence;
    c.store(4, std::memory_order_release);          // movl 4,[c];
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
}

It can be reordered to:

#include <atomic>

// Atomic load-store
void test() {
    std::atomic<int> a, b, c;
    a.store(2, std::memory_order_seq_cst);          // movl 2,[a]; mfence;
    int tmp = b.load(std::memory_order_seq_cst);    // movl [b],[tmp];
    c.store(4, std::memory_order_release);          // movl 4,[c];
}

Also, Sequential Consistency in x86/x86_64 can be implemented in four ways: http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html

  1. LOAD (without fence) and STORE + MFENCE
  2. LOAD (without fence) and LOCK XCHG
  3. MFENCE + LOAD and STORE (without fence)
  4. LOCK XADD ( 0 ) and STORE (without fence)
  • 1 and 2 ways: LOAD and (STORE+MFENCE)/(LOCK XCHG) - we reviewed above
  • 3 and 4 ways: (MFENCE+LOAD)/LOCK XADD and STORE - allow next reordering:

STORE-A(seq_cst); LOAD-C(acq_rel); can be reordered to LOAD-C(acq_rel); STORE-A(seq_cst);


  1. On PowerPC

STORE-A(seq_cst); LOAD-C(relaxed); can be reordered to LOAD-C(relaxed); STORE-A(seq_cst);

Allows Store-Load reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

Stores Reordered After Loads

I.e. PowerPC code:

STORE-A(seq_cst);
STORE-C(relaxed); 
LOAD-C(relaxed); 
LOAD-B(seq_cst);

Can be reordered to:

LOAD-C(relaxed);
STORE-A(seq_cst);
STORE-C(relaxed); 
LOAD-B(seq_cst);

PowerPC - GCC 4.8 : https://godbolt.org/g/xowFD3

C++ & asm - code:

#include <atomic>

// Atomic load-store
void test() {
    std::atomic<int> a, b, c;       // addr: 20, 24, 28
    a.store(2, std::memory_order_seq_cst);          // li r9<-2; sync; stw r9->[a];
    c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
    c.load(std::memory_order_relaxed);              // lwz r9<-[c];
    int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
}

By dividing a.store into two parts - it can be reordered to:

#include <atomic>

// Atomic load-store
void test() {
    std::atomic<int> a, b, c;       // addr: 20, 24, 28
    //a.store(2, std::memory_order_seq_cst);            // part-1: li r9<-2; sync;
    c.load(std::memory_order_relaxed);              // lwz r9<-[c];
    a.store(2, std::memory_order_seq_cst);          // part-2: stw r9->[a];
    c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
    int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
}

Where load-from-memory lwz r9<-[c]; executed earlier than store-to-memory stw r9->[a];.


Also on PowerPC can be such reordering:

STORE-A(seq_cst); STORE-C(relaxed); can reordered to STORE-C(relaxed); STORE-A(seq_cst);

Because PowerPC has weak memory ordering model - allows Store-Store reordering (Table 5 - PowerPC): http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2010.06.07c.pdf

Stores Reordered After Stores

I.e. on PowerPC operations Store can be reordered with other Store, then previous example can be reordered such as:

#include <atomic>

// Atomic load-store
void test() {
    std::atomic<int> a, b, c;       // addr: 20, 24, 28
    //a.store(2, std::memory_order_seq_cst);            // part-1: li r9<-2; sync;
    c.load(std::memory_order_relaxed);              // lwz r9<-[c];
    c.store(4, std::memory_order_relaxed);          // li r9<-4; stw r9->[c];
    a.store(2, std::memory_order_seq_cst);          // part-2: stw r9->[a];
    int tmp = b.load(std::memory_order_seq_cst);    // sync; lwz r9<-[b]; ... isync;
}

Where store-to-memory stw r9->[c]; executed earlier than store-to-memory stw r9->[a];.

-1
votes

The std::memory_order_seq_cst guarantees there is no reordering by either compiler nor cpu. In this case the same memory order as if only one instruction where executed at a time.

But the compiler optimization confuses the issues, if you turn off -O3 then the fence is there.

The compiler can see that in your test program with -O3 that there are no consequence of the mfence as the program is too simple.

If you ran it on an Arm on the other hand like this you can see the barriers dmb ish.

So if your program is more complex you might see the mfence in this part of the code but not if the compiler can analyse and reason that it is not needed.