5
votes

I'm not an ARM expert but won't those stores and loads be subjected to reordering at least on some ARM architectures?

  atomic<int> atomic_var; 
  int nonAtomic_var;
  int nonAtomic_var2;

  void foo()
  {       
          atomic_var.store(111, memory_order_relaxed);
          atomic_var.store(222, memory_order_relaxed);
  }

  void bar()
  {       
          nonAtomic_var = atomic_var.load(memory_order_relaxed);
          nonAtomic_var2 = atomic_var.load(memory_order_relaxed);
  }

I've had no success in making the compiler put memory barriers between them.

I've tried something like below (on x64):

$ arm-linux-gnueabi-g++ -mcpu=cortex-a9 -std=c++11 -S -O1 test.cpp

And I've got:

_Z3foov:
          .fnstart
  .LFB331:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          mov     r2, #111
          str     r2, [r3]
          mov     r2, #222
          str     r2, [r3]
          bx      lr
          ;...
  _Z3barv:
          .fnstart
  .LFB332:
          @ args = 0, pretend = 0, frame = 0
          @ frame_needed = 0, uses_anonymous_args = 0
          @ link register save eliminated.
          movw    r3, #:lower16:.LANCHOR0
          movt    r3, #:upper16:.LANCHOR0
          ldr     r2, [r3]
          str     r2, [r3, #4]
          ldr     r2, [r3]
          str     r2, [r3, #8]
          bx      lr

Are loads and stores to the same location never reordered on ARM? I couldn't find such restriction in the ARM docs.

I'm asking in regard to the c++11 standard which states that:

All modifications to any particular atomic variable occur in a total order that is specific to this one atomic variable.

1

1 Answers

5
votes

The total order for a single variable exists because of cache coherency (MESI): a store can't commit from the store buffer into L1d cache and become globally visible to other threads unless the core owns exclusive access to that cache line. (MESI Exclusive or Modified state.)

That C++ guarantee doesn't require any barriers to implement on any normal CPU architecture because all normal ISAs have coherent caches, normally using a variant of MESI. This is why volatile happens to work as a legacy / UB version of mo_relaxed atomic on mainstream C++ implementations (but generally don't do it). See also When to use volatile with multi threading? for more details.

(Some systems exist with two different kinds of CPU that share memory, e.g. microcontroller + DSP, but C++ std::thread won't start threads across cores that don't share a coherent view of that memory. So compilers only have to do code-gen for ARM cores in the same inner-shared coherency domain.)


For any given atomic object, a total order of modification by all threads will always exist (as guaranteed by the ISO C++ standard you quoted), but you don't know ahead of time what it's going to be unless you establish synchronization between threads.

e.g. different runs of this program could have both loads go first, or one load then both stores then the other load.

This total order (for a single variable) will be compatible with program order for each thread, but is an arbitrary interleaving of program orders.

memory_order_relaxed only atomic operation on that variable, not ordering wrt. anything else. The only ordering that's fixed at compile time is wrt. other accesses to the same atomic variable by this thread.

Different threads will agree on the modification order for this variable, but might disagree on the global modification order for all objects. (ARMv8 made the ARM memory model multi-copy-atomic so this is impossible (and probably no real earlier ARM violated that), but POWER does in real life allow two independent reader threads to disagree on the order of stores by 2 other independent writer threads. This is called IRIW reordering. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)

The fact that IRIW reordering is a possibility when multiple variables are involved is (among other things) why it even needs to be said that a total modification order does always exist for each individual variable separately.

For an all-thread total order to exist, you need all your atomic accesses to use seq_cst, which would involve barriers. But that still wouldn't of course fully determine at compile time what that order will be; different timings on different runs will lead to acquire loads seeing a certain store or not.

Are loads and stores to the same location never reordered on ARM?

From within a single thread no. If you do multiple stores to a memory location, the last one in program order will always appear as the last to other threads. i.e. once the dust settles, the memory location will have the value stored by the last store. Anything else would break the illusion of program order for threads reloading their own stores.


Some of the ordering guarantees in the C++ standard are even called "write-write coherency" and other kinds of coherency. ISO C++ doesn't explicitly require coherent caches (an implementation on an ISA that needs explicit flushing is possible), but would not be efficient.

http://eel.is/c++draft/intro.races#19

[ Note: The four preceding coherence requirements effectively disallow compiler reordering of atomic operations to a single object, even if both operations are relaxed loads. This effectively makes the cache coherence guarantee provided by most hardware available to C++ atomic operations. — end note ]


Most of the above is about modification order, not LoadLoad reordering.

That is a separate thing. C++ guarantees read-read coherence, i.e. that 2 reads of the same atomic object by the same thread happen in program order relative to each other.

http://eel.is/c++draft/intro.races#16

If a value computation A of an atomic object M happens before a value computation B of M, and A takes its value from a side effect X on M, then the value computed by B shall either be the value stored by X or the value stored by a side effect Y on M, where Y follows X in the modification order of M. [ Note: This requirement is known as read-read coherence. — end note ]

A "value computation" is a read aka load of a variable. The highlighted phrase is the part that guarantees that later reads in the same thread can't observe earlier writes from other threads (earlier than a write they already saw).

That's one of the 4 conditions that the previous quote I linked was talking about.

The fact that compilers compile it to two plain ARM loads is proof enough that the ARM ISA also guarantees this. (Because we know for sure that ISO C++ requires it.)

I'm not familiar with ARM manuals but presumably it's in there somewhere.

See also A Tutorial Introduction to the ARM and POWER Relaxed Memory Models - a paper that goes into significant detail about what reorderings are/aren't allowed for various test cases.