3
votes

Introduction

I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.

Class description

The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).

A single thread is calling push(), and another single thread is calling pop().

Source I've read

Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.

What I think each memory order do using my words, from my own research

memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.

memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.

memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.

The LockFreeEx class

template<typename T>
class LockFreeEx
{
public:
    void push(const T& element)
    {
        const int wPos = m_position.load(std::memory_order_seq_cst);
        const int nextPos = getNextPos(wPos);
        m_buffer[nextPos] = element;
        m_position.store(nextPos, std::memory_order_seq_cst);
    }

    const bool pop(T& returnedElement)
    {

        const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
        if (wPos != -1)
        {
            returnedElement = m_buffer[wPos]; 
            return true;
        }
        else
        {
            return false;
        }
    }

private:
    static constexpr int maxElements = 8;
    static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
    std::array<T, maxElements> m_buffer;
    std::atomic<int> m_position {-1};
};

How I expect it could be improved

So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.

Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.

Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.

Final

Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?

I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!

Thanks for reading.

EDIT: EXTRA EXPLANATION:

Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.

The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.

I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.

Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.

Thanks again to everyone trying to assist me with memory orders! :)

AFTER READING SOLUTIONS:

template<typename T>
class LockFreeEx
{
public:
    LockFreeEx() {}
    LockFreeEx(const T& initValue): m_data(initValue) {}

    // WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
    void publish(const T& element)
    {
        // I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
        const int wPos = m_writePos.load(std::memory_order_acquire);
        const int nextPos = (wPos + 1) % bufferMaxSize;
        m_buffer[nextPos] = element;
        m_writePos.store(nextPos, std::memory_order_release);
    }


    // READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
    inline void update() 
    {
        // should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
        const int writeIndex = m_writePos.load(std::memory_order_acquire); 
        // Updating only in case there is something new... T may be a heavy struct
        if (m_readPos != writeIndex)
        {
            m_readPos = writeIndex;
            m_data = m_buffer[m_readPos];
        }
    }
    // NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
    inline const T& get() const noexcept {return m_data;}

private:
    // Buffer
    static constexpr int bufferMaxSize = 4;
    std::array<T, bufferMaxSize> m_buffer;

    std::atomic<int> m_writePos {0};
    int m_readPos = 0;

    // Data
    T m_data;
};
2
1) You should use exactly the necessary ordering for your algorithm. 2) Some CPU are strongly ordered and always provide release in store operations, when done in asm. 3) Just because the CPU provides a semantic doesn't imply the compiler will, you still need to say release when you mean release.curiousguy
Yes, but that was before I knew that using smaller types doesn't improve performance.Juan JuezSarmiento
char might be either signed or unsigned if you don't explicitly specify it. It actually doesn't matter in your code, but it is something to keep in mind.G. Sliepen
@JuanJuezSarmiento: you might just want a SeqLock instead. It's hard to avoid C++ UB while getting a C++ compiler to generate safe machine code for any particular target, but fortunately this is one of the rare cases where "what the compiler doesn't know can't hurt it" because data-race UB isn't visible at compile time. See Implementing 64 bit atomic counter with 32 bit atomics and maybe also Optimal way to pass a few variables between 2 threads pinning different CPUsPeter Cordes
What you've invented is like RCU but without the safety. Like in RCU, the read side is wait-free (never has to retry) and could be read-only if you weren't marking it as "done". But if the read side sleeps for some reasons, the array entry it's using could be overwritten before the reader is finished reading it, leading to tearing. If you know that the chance of that happening is small enough (and the resulting error isn't catastrophic), given the known write speeds and other details of your use-case, then yeah this might be a good design.Peter Cordes

2 Answers

5
votes

Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.

And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:

m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);

as it is written. It just as well can see it this way:

m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;

So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.


And to your actual code. I think the order can be as follows:

const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);
3
votes

Your writer only needs release, not seq-cst, but relaxed is too weak. You can't publish a value for m_position until after the non-atomic assignment to the corresponding m_buffer[] entry. You need release ordering to make sure the m_position store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/

This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume in the reader.

In theory you also need wpos = m_position to be at least acquire (or consume in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos before the load actually takes a value from coherent cache.

(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos.)

But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position and then using m_buffer[wPos] will create a data dependency in the asm, like mo_consume is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)

When compiling for x86, there's no downside at all to using mo_acquire; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume to mo_acquire so we unfortunately can't take advantage of it.


You already have a real-word race condition even using seq_cst.

  • initial state: m_position = 0
  • reader "claims" slot 0 by exchanging in a m_position = -1 and reads part of m_buffer[0];
  • reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
  • writer reads wPos = m_position as -1, and calculates nextPos = 0.
    It overwrites the partially-read m_buffer[0]
  • reader wakes up and finishes reading, getting a torn T &element. Data race UB in the C++ abstract machine, and tearing in practice.

Adding a 2nd check of m_position after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position until after writing the buffer element.

Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.

I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.

A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.


Code review:

The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).

This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.

I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position in push makes it clearly unsafe for multiple writers.)

Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position can update in this order: 2, -1, 3.

That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.

The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics


Possible design changes

To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.

If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[] from the write-position.

    // Possible failure mode: writer wraps around between reads, leaving same m_position
    // single-reader
    const bool read(T &elem)
    {
        // FIXME: big hack to get this in a separate cache line from the instance vars
        // maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
        static int lastread = -1;

        int wPos = m_position.load(std::memory_order_acquire);    // or cheat with relaxed to get asm that's like "consume"
        if (lastread == wPos)
            return false;

        elem = m_buffer[wPos];
        lastread = wPos;
        return true;
    }

You want lastread in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.

This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position from its store buffer to L1d cache.

TODO: let m_position increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos.

Use wPos & (maxElements-1) as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");

Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.


If there are multiple readers, another good option might be a claimed flag in each m_buffer entry. So you'd define

template<typename T>
class WaitFreePublish
{

private:
    struct {
        alignas(32) T elem;           // at most 2 elements per cache line
        std::atomic<int8_t> claimed;  // writers sets this to 0, readers try to CAS it to 1
                                      // could be bool if we don't end up needing 3 states for anything.
                                      // set to "1" in the constructor?  or invert and call it "unclaimed"
    } m_buffer[maxElements];

    std::atomic<int> m_position {-1};
}

If T has padding at the end, it's a shame we can't take advantage of that for the claimed flag :/

This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed flag first, before writing the rest of the element.

With no other threads writing m_position, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.

So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.

/// claimed flag per array element supports concurrent readers

    // thread-safety: single-writer only
    // update claimed flag first, then element, then m_position.
    void publish(const T& elem)
    {
        const int wPos = m_position.load(std::memory_order_relaxed);
        const int nextPos = getNextPos(wPos);

        m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_release);  // make sure that `0` is visible *before* the non-atomic element modification
        m_buffer[nextPos].elem = elem;

        m_position.store(nextPos, std::memory_order_release);
    }

    // thread-safety: multiple readers are ok.  First one to claim an entry gets it
    // check claimed flag before/after to detect overwrite, like a SeqLock
    const bool read(T &elem)
    {
        int rPos = m_position.load(std::memory_order_acquire);

        int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
        if (claimed != 0)
            return false;      // read-only early-out

        claimed = 0;
        if (!m_buffer[rPos].claimed.compare_exchange_strong(
                claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
            return false;  // strong CAS failed: another thread claimed it

        elem = m_buffer[rPos].elem;

        // final check that the writer didn't step on this buffer during read, like a SeqLock
        std::atomic_thread_fence(std::memory_order_acquire);    // LoadLoad barrier

        // We expect it to still be claimed=1 like we set with CAS
        // Otherwise we raced with a writer and elem may be torn.
        //  optionally retry once or twice in this case because we know there's a new value waiting to be read.
        return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;

        // Note that elem can be updated even if we return false, if there was tearing.  Use a temporary if that's not ok.
    }

Using claimed = m_buffer[rPos].exchange(1) and checking for claimed==0 would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected, in which case the read-only check is pointless.

I used .claimed.compare_exchange_strong(claimed, 1) with success ordering = acquire to make sure that read of claimed happens-before reading .elem.

The "failure" memory ordering can be relaxed: If we see it already claimed by another thread, we give up and don't look at any shared data.

The memory-ordering of the store part of compare_exchange_strong can be relaxed, so we just need mo_acquire, not acq_rel. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.

It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)

I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem. A release store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0 in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.


If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.

Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release); would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)

This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()'s caller ignores its elem value if it returns false).

Making elem a std::atomic<T> would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.

Using volatile T elem would break buffer[i].elem = elem because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T argument and does individual members. So clearly volatile is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence is I think actually safe for that, like asm("" ::: "memory") in GNU C. It works in practice on current compilers.