Independent Read-Modify-Write Ordering

Question

I was running a bunch of algorithms through Relacy to verify their correctness and I stumbled onto something I didn't really understand. Here's a simplified version of it:

#include <thread>
#include <atomic>
#include <iostream>
#include <cassert> 

struct RMW_Ordering
{
    std::atomic<bool> flag {false};
    std::atomic<unsigned> done {0}, counter {0};
    unsigned race_cancel {0}, race_success {0}, sum {0};

    void thread1() // fail
    {
        race_cancel = 1; // data produced

        if (counter.fetch_add(1, std::memory_order_release) == 1 &&
            !flag.exchange(true, std::memory_order_relaxed))
        {
            counter.store(0, std::memory_order_relaxed);
            done.store(1, std::memory_order_relaxed);
        }
    }

    void thread2() // success
    {
        race_success = 1; // data produced

        if (counter.fetch_add(1, std::memory_order_release) == 1 &&
            !flag.exchange(true, std::memory_order_relaxed))
        {
            done.store(2, std::memory_order_relaxed);
        }
    }

    void thread3()
    {
        while (!done.load(std::memory_order_relaxed)); // livelock test
        counter.exchange(0, std::memory_order_acquire);
        sum = race_cancel + race_success;
    }
};

int main()
{
    for (unsigned i = 0; i < 1000; ++i)
    {
        RMW_Ordering test;

        std::thread t1([&]() { test.thread1(); });    
        std::thread t2([&]() { test.thread2(); });
        std::thread t3([&]() { test.thread3(); });

        t1.join();
        t2.join();
        t3.join();

        assert(test.counter == 0);
    }

    std::cout << "Done!" << std::endl;
}

Two threads race to enter a protected region and the last one modifies done, releasing a third thread from an infinite loop. The example is a bit contrived but the original code needs to claim this region through the flag to signal "done".

Initially, the fetch_add had acq_rel ordering because I was concerned the exchange might get reordered before it, potentially causing one thread to claim the flag, attempt the fetch_add check first, and prevent the other thread (which gets past the increment check) from successfully modifying the schedule. While testing with Relacy, I figured I'd see whether the livelock I expected to happen will take place if I switched from acq_rel to release, and to my surprise, it didn't. I then used relaxed for everything, and again, no livelock.

I tried to find any rules regarding this in the C++ standard but only managed to dig up these:

1.10.7 In addition, there are relaxed atomic operations, which are not synchronization operations, and atomic read-modify-write operations, which have special characteristics.

29.3.11 Atomic read-modify-write operations shall always read the last value (in the modification order) written before the write associated with the read-modify-write operation.

Can I always rely on RMW operations not being reordered - even if they affect different memory locations - and is there anything in the standard that guarantees this behaviour?

EDIT:

I came up with a simpler setup that should illustrate my question a little better. Here's the CppMem script for it:

int main() 
{
    atomic_int x = 0; atomic_int y = 0;
{{{
{
    if (cas_strong_explicit(&x, 0, 1, relaxed, relaxed))
    {
        cas_strong_explicit(&y, 0, 1, relaxed, relaxed);
    }
}
|||
{
    if (cas_strong_explicit(&x, 0, 2, relaxed, relaxed))
    {
        cas_strong_explicit(&y, 0, 2, relaxed, relaxed);
    }
}
|||
{
    // Is it possible for x and y to read 2 and 1, or 1 and 2?
    x.load(relaxed).readsvalue(2);
    y.load(relaxed).readsvalue(1);
}
}}}
  return 0; 
}

I don't think the tool is sophisticated enough to evaluate this scenario, though it does seem to indicate that it's possible. Here's the almost equivalent Relacy setup:

#include "relacy/relacy_std.hpp"

struct rmw_experiment : rl::test_suite<rmw_experiment, 3>
{
    rl::atomic<unsigned> x, y;

    void before()
    {
        x($) = y($) = 0;
    }

    void thread(unsigned tid)
    {
        if (tid == 0)
        {
            unsigned exp1 = 0;
            if (x($).compare_exchange_strong(exp1, 1, rl::mo_relaxed))
            {
                unsigned exp2 = 0;
                y($).compare_exchange_strong(exp2, 1, rl::mo_relaxed);
            }
        }
        else if (tid == 1)
        {
            unsigned exp1 = 0;
            if (x($).compare_exchange_strong(exp1, 2, rl::mo_relaxed))
            {
                unsigned exp2 = 0;
                y($).compare_exchange_strong(exp2, 2, rl::mo_relaxed);
            }
        }
        else
        {
            while (!(x($).load(rl::mo_relaxed) && y($).load(rl::mo_relaxed)));
            RL_ASSERT(x($) == y($));
        }
    }
};

int main()
{
    rl::simulate<rmw_experiment>();
}

The assertion is never violated, so 1 and 2 (or the reverse) is not possible according to Relacy.

Peter Cordes Peter Cordes · Accepted Answer · 2020-02-24T20:22:47

I haven't fully grokked your code yet, but the bolded question has a straightforward answer:

Can I always rely on RMW operations not being reordered - even if they affect different memory locations

No, you can't. Compile-time reordering of two relaxed RMWs in the same thread is very much allowed. (I think runtime reordering of two RMWs is probably impossible in practice on most CPUs. ISO C++ doesn't distinguish compile-time vs. run-time for this.)

But note that an atomic RMW includes both a load and a store, and both parts have to stay together. So any kind of RMW can't move earlier past an acquire operation, or later past a release operation.

Also, of course the RMW itself being a release and/or acquire operation can stop reordering in one or the other direction.

Of course, the C++ memory model isn't formally defined in terms of local reordering of access to cache-coherent shared memory, only in terms of synchronizing with another thread and creating a happens-before / after relationship. But if you ignore IRIW reordering (2 reader threads not agreeing on the order of two writer threads doing independent stores to different variables) it's pretty much 2 different ways to model the same thing.

Independent Read-Modify-Write Ordering

2 Answers