2
votes

I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:

#include <thread>
#include <atomic>
#include <iostream>
#include <cassert> 

struct RMW_Ordering
{
    std::atomic<bool> flag {false};
    std::atomic<unsigned> done {0}, counter {0};
    unsigned race_cancel {0}, race_success {0}, sum {0};

    void thread1() // fail
    {
        race_cancel = 1; // data produced

        if (counter.fetch_add(1, std::memory_order_release) == 1 &&
            !flag.exchange(true, std::memory_order_relaxed))
        {
            counter.store(0, std::memory_order_relaxed);
            done.store(1, std::memory_order_relaxed);
        }
    }

    void thread2() // success
    {
        race_success = 1; // data produced

        if (counter.fetch_add(1, std::memory_order_release) == 1 &&
            !flag.exchange(true, std::memory_order_relaxed))
        {
            done.store(2, std::memory_order_relaxed);
        }
    }

    void thread3()
    {
        while (!done.load(std::memory_order_relaxed)); // livelock test
        counter.exchange(0, std::memory_order_acquire);
        sum = race_cancel + race_success;
    }
};

int main()
{
    for (unsigned i = 0; i < 1000; ++i)
    {
        RMW_Ordering test;

        std::thread t1([&]() { test.thread1(); });    
        std::thread t2([&]() { test.thread2(); });
        std::thread t3([&]() { test.thread3(); });

        t1.join();
        t2.join();
        t3.join();

        assert(test.counter == 0);
    }

    std::cout << "Done!" << std::endl;
}

Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".

Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.

I tried to find any rules regarding this in the C++ standard but only managed to dig up these:

1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics.

29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?

EDIT:

I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:

int main() 
{
    atomic_int x = 0; atomic_int y = 0;
{{{
{
    if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
    {
        cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
    }
}
|||
{
    if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
    {
        cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
    }
}
|||
{
    // Is it possible for x and y to read 2 and 1, or 1 and 2?
    x.load(relaxed).readsvalue(2);
    y.load(relaxed).readsvalue(1);
}
}}}
  return 0; 
}

I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:

#include "relacy/relacy_std.hpp"

struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
    rl::atomic<unsigned> x, y;

    void before()
    {
        x($) = y($) = 0;
    }

    void thread(unsigned tid)
    {
        if (tid == 0)
        {
            unsigned exp1 = 0;
            if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
            {
                unsigned exp2 = 0;
                y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
            }
        }
        else if (tid == 1)
        {
            unsigned exp1 = 0;
            if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
            {
                unsigned exp2 = 0;
                y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
            }
        }
        else
        {
            while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
            RL_ASSERT(x($) == y($));
        }
    }
};

int main()
{
    rl::simulate<rmw_experiment>();
}

The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.

2

2 Answers

2
votes

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:

Can I always rely on RMW operations not being reordered - even if they affect different memory locations

No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)

But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.

Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.


Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

2
votes

In your first example it is guaranteed that the flag.exchange is always executed after the counter.fetch_add, because the && short circuits - i.e., if the first expression resolves to false, the second expression is never executed. The C++ standard guarantees this, so the compiler cannot reorder the two expressions (regardless which memory order they use).

As Peter Cordes already explained, the C++ standard says nothing about if or when instructions can be reordered with respect to atomic operations. In general, most compiler optimizations rely on the as-if:

The semantic descriptions in this International Standard define a parameterized nondeterministic abstract machine. This International Standard places no requirement on the structure of conforming implementations. In particular, they need not copy or emulate the structure of the abstract machine. Rather, conforming implementations are required to emulate (only) the observable behavior of the abstract machine [..].

This provision is sometimes called the “as-if” rule, because an implementation is free to disregard any requirement of this International Standard as long as the result is as if the requirement had been obeyed, as far as can be determined from the observable behavior of the program. For instance, an actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no side effects affecting the observable behavior of the program are produced.

The key aspect here is the "observable behavior". Suppose you have two relaxed atomic loads A and B on two different atomic objects, where A is sequenced before B.

std::atomic<int> x, y;

x.load(std::memory_order_relaxed); // A
y.load(std::memory_order_relaxed); // B

A sequence-before relation is part of the definition of the happens-before relation, so one might assume that the two operations cannot be reordered. However, since the two operations are relaxed, there is no guarantee about the "observable behavior", i.e., even with the original order, the x.load (A) could return a newer result than the y.load (B), so the compiler would be free to reorder them, since the final program would not be able to tell the difference (i.e., the observable behavior is equivalent). If it would not be equivalent, then you would have a race condition! ;-)

To prevent such reorderings you have to rely on the (inter-thread) happens-before relation. If the x.load (A) would use memory_order_acquire, then the compiler would have to assume that this operation synchronizes-with some release operation, thus establishing a (inter-thread) happens-before relation. Suppose some other thread performs two atomic updates:

y.store(42, std::memory_order_relaxed); // C
x.store(1, std::memory_order_release); // D

If the acquire-load A sees the value store by the store-release D, then the two operations synchronize with each other, thereby establishing a happens-before relation. Since y.store is sequenced before x.store, and x.load is sequenced before, the transitivity of the happens-before relation guarantees that y.store happens-before y.load. Reordering the two loads or the two stores would destroy this guarantee and therefore also change the observable behavior. Thus, the compiler cannot perform such reorders.

In general, arguing about possible reorderings is the wrong approach. In a first step you should always identify your required happens-before relations (e.g., the y.store has to happen before the y.load) . The next step is then to ensure that these happens-before relations are correctly established in all cases. At least that is how I approach correctness arguments for my implementations of lock-free algorithms.

Regarding Relacy: Relacy only simulates the memory model, but it relies on the order of operations as generated by the compiler. So even if a compiler could reorder two instructions, but chooses not to, you will not be able to identify this with Relacy.