13
votes

I was under the impression that memory loads could not be hoisted above an acquiring load in the C++11 memory model. However looking at the code that gcc 4.8 produces that only seems to be true for other atomic loads, not all of memory. If that's true and acquiring loads don't synchronize all memory (just std::atomics) then I'm not sure how it would be possible to implement general purpose mutexes in terms of std::atomic.

The following code:

extern std::atomic<unsigned> seq;
extern std::atomic<int> data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data.load(std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces:

_Z6readerv:
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov eax, DWORD PTR data[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

Which looks correct to me.

However changing data to be an int rather than std::atomic:

extern std::atomic<unsigned> seq;
extern int data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data;
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces this:

_Z6readerv:
    mov eax, DWORD PTR data[rip]
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

So what's going on?

3
If you rewrite atomic ops order to load(rel); fence(acq); in second version, does its output asm change?yohjp
@yoyjp Are you referring to the loading of seq0? If so then no, it doesn't affect the code generated at all.jleahy
No, I mentioned seq1. An "acquire fence" which has acquire semantics is consist of seq1.load(relaxed) -> fence(acquire) ops order, not fence(acquire) -> seq1.load(relaxed) in C++11 memory model. C++'s "fence" only influences happens-before relationship between atomic operations or/and fences, it have no directly impact on non-atomic vars. In this point, C++'s "fence" is quite different from processor's/compiler's memory barrier instruction (like mfence of x86).yohjp
@yohjp Have a look at the edit I just made, it reduces the complexity a lot. Do you have a standards quote for "only influences happens-before relationship between atomic operations or/and fences" - if that's true then it's not possible to synchronize non-atomic data with these fences.jleahy
That sentence is not direct quote from C++11 standard, but result of my interpreting 29.8 [atomic.fences] and 1.10 [intro.multithread]. IMO "it's not possible to synchronize non-atomic data with these fences" (as you say), and synchronization for non-atomic data is attained with a combination of 'non-atomic + atomic ops' or 'non-atomic + atomic ops + atomic_fence'.yohjp

3 Answers

4
votes

Why a load was hoisted above an acquire

I've posted this on the gcc bugzilla and they've confirmed it as a bug.

the MEM alias-set of -1 (ALIAS_SET_MEMORY_BARRIER) is supposed to prevent this, but PRE does not know about this special property (it should "kill" all refs crossing it).

It looks like the gcc wiki has a nice page about this.

Generally, release is a barrier to sinking code, acquire is a barrier to hoisting code.

Why this code is still broken

As per this paper my code is still incorrect, because it introduces a data race. Even though the patched gcc generates the correct code, it's still not proper to access data without wrapping it in std::atomic. The reason is that data races are undefined behavior, even if computations resulting from them are discarded.

An example courtesy of AdamH.Peterson:

int foo(unsigned x) {
    if (x < 10) {
        /* some calculations that spill all the 
           registers so x has to be reloaded below */
        switch (x) {
        case 0:
            return 5;
        case 1:
            return 10;
        // ...
        case 9:
            return 43;
        }
    }
    return 0;
}

Here a compiler might optimise the switch into a jump table, and thanks to the if statement above would be able to avoid a range check. However if data races were not undefined behaviour then an second range check would be required.

1
votes

I do not think your atomic_thread_fence is correct. The only C++11 memory model that works with your code would be seq_cst one. But this is very expensive (you're going to get a full memory fence) for what you need.

The original code works and I think this is the best performance tradeoff.

EDIT based on your updates:

If you're looking for the formal reason why the code with a regular int is not working the way you'd like, I believe the very paper you quoted (http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) gives the answer. Look at the end of section 2. Your code has the same problem as the code in Figure 1. It has data races. Multiple threads can do operations on the same memory on the regular int at the same time. It it forbidden by the c++11 memory model, this code is formally not valid C++ code.

gcc expects the code to have no data race, i.e being valid C++ code. Since there is no race and the code loads the int unconditionally, a load can be emitted anywhere in the body of the function. So gcc is smart and it just emits it once since it is not volatile. The conditional statement that usually goes hand in hand with an acquire barrier plays an important role in what the compiler will do.

In the formal slang of the standard, the atomic loads and the regular int loads are unsequenced. The introduction for example of a condition would create a sequence point and would force the compiler to evaluate the regular int after the sequence point (http://msdn.microsoft.com/en-us/library/d45c7a5d.aspx). Then the c++ memory model would do the rest (i.e ensure the visibility by the cpu executing the instructions)

So neither of your statements are true. You can definitely build a lock with c++11, just not one with data races :-) Typically a lock would involve waiting before reading (which is obviously what you're trying to avoid here) so you do not have this kind of problems.

Note that your original seqlock is buggy because you would not want to just check seq0 != seq1 (you could be in the middle of an update). The seqlock paper has the correct condition.

0
votes

I'm still new at reasoning about these non-sequentially-consistent memory order operations and barriers, but it could be that this code generation is correct (or rather permissible). On the face of it, it certainly looks fishy, but I wouldn't be surprised if there's no way for a standard-conforming program to tell that the load from data was hoisted (which would mean this code is correct under the "as if" rule).

The program is reading two subsequent values from an atomic, one before the load and one after the load, and re-issuing the load whenever they don't match. In principle, there's no reason the two atomic reads ever have to see different values from each other. Even if an atomic write has just occurred, there's no way for this thread to be able to detect that it didn't read the old value again. The thread would then go back into the loop and eventually read two consistent values from the atomic, then return, but since seq0 and seq1 are then discarded, the program can't tell that the value in seq0 doesn't correspond to the value read from data. Now, in principle, this also suggests to me that the entire loop could have been elided and only the load from data is actually necessary for correctness, but failure to elide the loop isn't necessarily a correctness issue.

If reader() were to return a pair<int,unsigned> that included seq0 (or seq1) and the same hoisted loop were generated, I think it's probably incorrect code (but again I'm new to this non-sequentially-consistent operations reasoning).