Hoisting of non-atomic loads up through acquiring atomic loads

Question

I was under the impression that memory loads could not be hoisted above an acquiring load in the C++11 memory model. However looking at the code that gcc 4.8 produces that only seems to be true for other atomic loads, not all of memory. If that's true and acquiring loads don't synchronize all memory (just std::atomics) then I'm not sure how it would be possible to implement general purpose mutexes in terms of std::atomic.

The following code:

extern std::atomic<unsigned> seq;
extern std::atomic<int> data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data.load(std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces:

_Z6readerv:
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov eax, DWORD PTR data[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

Which looks correct to me.

However changing data to be an int rather than std::atomic:

extern std::atomic<unsigned> seq;
extern int data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data;
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces this:

_Z6readerv:
    mov eax, DWORD PTR data[rip]
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

So what's going on?

If you rewrite atomic ops order to load(rel); fence(acq); in second version, does its output asm change? — yohjp
@yoyjp Are you referring to the loading of seq0? If so then no, it doesn't affect the code generated at all. — jleahy
No, I mentioned seq1. An "acquire fence" which has acquire semantics is consist of seq1.load(relaxed) -> fence(acquire) ops order, not fence(acquire) -> seq1.load(relaxed) in C++11 memory model. C++'s "fence" only influences happens-before relationship between atomic operations or/and fences, it have no directly impact on non-atomic vars. In this point, C++'s "fence" is quite different from processor's/compiler's memory barrier instruction (like mfence of x86). — yohjp
@yohjp Have a look at the edit I just made, it reduces the complexity a lot. Do you have a standards quote for "only influences happens-before relationship between atomic operations or/and fences" - if that's true then it's not possible to synchronize non-atomic data with these fences. — jleahy
That sentence is not direct quote from C++11 standard, but result of my interpreting 29.8 [atomic.fences] and 1.10 [intro.multithread]. IMO "it's not possible to synchronize non-atomic data with these fences" (as you say), and synchronization for non-atomic data is attained with a combination of 'non-atomic + atomic ops' or 'non-atomic + atomic ops + atomic_fence'. — yohjp

jleahy jleahy · Accepted Answer · 2013-05-29T20:25:59

Why a load was hoisted above an acquire

I've posted this on the gcc bugzilla and they've confirmed it as a bug.

the MEM alias-set of -1 (ALIAS_SET_MEMORY_BARRIER) is supposed to prevent this, but PRE does not know about this special property (it should "kill" all refs crossing it).

It looks like the gcc wiki has a nice page about this.

Generally, release is a barrier to sinking code, acquire is a barrier to hoisting code.

Why this code is still broken

As per this paper my code is still incorrect, because it introduces a data race. Even though the patched gcc generates the correct code, it's still not proper to access data without wrapping it in std::atomic. The reason is that data races are undefined behavior, even if computations resulting from them are discarded.

An example courtesy of AdamH.Peterson:

int foo(unsigned x) {
    if (x < 10) {
        /* some calculations that spill all the 
           registers so x has to be reloaded below */
        switch (x) {
        case 0:
            return 5;
        case 1:
            return 10;
        // ...
        case 9:
            return 43;
        }
    }
    return 0;
}

Here a compiler might optimise the switch into a jump table, and thanks to the if statement above would be able to avoid a range check. However if data races were not undefined behaviour then an second range check would be required.

Hoisting of non-atomic loads up through acquiring atomic loads

3 Answers