Could this publish / check-for-update class for a single writer + reader use memory_order_relaxed or acquire/release for efficiency?

Question

Introduction

I have a small class which make use of std::atomic for a lock free operation. Since this class is being called massively, it's affecting the performance and I'm having trouble.

Class description

The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).

A single thread is calling push(), and another single thread is calling pop().

Source I've read

Since this is using too much time of my computer time, I decided to study a bit further the std::atomic class and its memory_order. I've read a lot of memory_order post avaliable in StackOverflow and other sources and books, but I'm not able to get a clear idea about the different modes. Specially, I'm struggling between acquire and release modes: I fail too see why they are different to memory_order_seq_cst.

What I think each memory order do using my words, from my own research

memory_order_relaxed: In the same thread, the atomic operations are instant, but other threads may fail to see the lastest values instantly, they will need some time until they are updated. The code can be re-ordered freely by the compiler or OS.

memory_order_acquire / release: Used by atomic::load. It prevents the lines of code there are before this from being reordered (the compiler/OS may reorder after this line all it want), and reads the lastest value that was stored on this atomic using memory_order_release or memory_order_seq_cst in this thread or another thread. memory_order_release also prevents that code after it may be reordered. So, in an acquire/release, all the code between both can be shuffled by the OS. I'm not sure if that's between same thread, or different threads.

memory_order_seq_cst: Easiest to use because it's like the natural writting we are used with variables, instantly refreshing the values of other threads load functions.

The LockFreeEx class

template<typename T>
class LockFreeEx
{
public:
    void push(const T& element)
    {
        const int wPos = m_position.load(std::memory_order_seq_cst);
        const int nextPos = getNextPos(wPos);
        m_buffer[nextPos] = element;
        m_position.store(nextPos, std::memory_order_seq_cst);
    }

    const bool pop(T& returnedElement)
    {

        const int wPos = m_position.exchange(-1, std::memory_order_seq_cst);
        if (wPos != -1)
        {
            returnedElement = m_buffer[wPos]; 
            return true;
        }
        else
        {
            return false;
        }
    }

private:
    static constexpr int maxElements = 8;
    static constexpr int getNextPos(int pos) noexcept {return (++pos == maxElements)? 0 : pos;}
    std::array<T, maxElements> m_buffer;
    std::atomic<int> m_position {-1};
};

How I expect it could be improved

So, my first idea was using memory_order_relaxed in all atomic operations, since the pop() thread is in a loop looking for avaliable updates in pop function each 10-15 ms, then it's allowed to fail in the firsts pop() functions to realize later that there is a new update. It's only a bunch of milliseconds.

Another option would be using release/acquire - but I'm not sure about them. Using release in all store() and acquire in all load() functions.

Unfortunately, all the memory_order I described seems to work, and I'm not sure when will they fail, if they are supposed to fail.

Final

Please, could you tell me if you see some problem using relaxed memory order here? Or should I use release/acquire (maybe a further explanation on these could help me)? why?

I think that relaxed is the best for this class, in all its store() or load(). But I'm not sure!

Thanks for reading.

EDIT: EXTRA EXPLANATION:

Since I see everyone is asking for the 'char', I've changed it to int, problem solved! But it doesn't it the one I want to solve.

The class, as I stated before, is something likely to a LIFO but where only matters the last element pushed, if there is any.

I have a big struct T (copiable and asignable), that I must share between two threads in a lock-free way. So, the only way I know to do it is using a circular buffer that writes the last known value for T, and a atomic which know the index of the last value written. When there isn't any, the index would be -1.

Notice that my push thread must know when there is a "new T" avaliable, that's why pop() returns a bool.

Thanks again to everyone trying to assist me with memory orders! :)

AFTER READING SOLUTIONS:

template<typename T>
class LockFreeEx
{
public:
    LockFreeEx() {}
    LockFreeEx(const T& initValue): m_data(initValue) {}

    // WRITE THREAD - CAN BE SLOW, WILL BE CALLED EACH 500-800ms
    void publish(const T& element)
    {
        // I used acquire instead relaxed to makesure wPos is always the lastest w_writePos value, and nextPos calculates the right one
        const int wPos = m_writePos.load(std::memory_order_acquire);
        const int nextPos = (wPos + 1) % bufferMaxSize;
        m_buffer[nextPos] = element;
        m_writePos.store(nextPos, std::memory_order_release);
    }


    // READ THREAD - NEED TO BE VERY FAST - CALLED ONCE AT THE BEGGINING OF THE LOOP each 2ms
    inline void update() 
    {
        // should I change to relaxed? It doesn't matter I don't get the new value or the old one, since I will call this function again very soon, and again, and again...
        const int writeIndex = m_writePos.load(std::memory_order_acquire); 
        // Updating only in case there is something new... T may be a heavy struct
        if (m_readPos != writeIndex)
        {
            m_readPos = writeIndex;
            m_data = m_buffer[m_readPos];
        }
    }
    // NEED TO BE LIGHTNING FAST, CALLED MULTIPLE TIMES IN THE READ THREAD
    inline const T& get() const noexcept {return m_data;}

private:
    // Buffer
    static constexpr int bufferMaxSize = 4;
    std::array<T, bufferMaxSize> m_buffer;

    std::atomic<int> m_writePos {0};
    int m_readPos = 0;

    // Data
    T m_data;
};

1) You should use exactly the necessary ordering for your algorithm. 2) Some CPU are strongly ordered and always provide release in store operations, when done in asm. 3) Just because the CPU provides a semantic doesn't imply the compiler will, you still need to say release when you mean release. — curiousguy
Yes, but that was before I knew that using smaller types doesn't improve performance. — Juan JuezSarmiento
char might be either signed or unsigned if you don't explicitly specify it. It actually doesn't matter in your code, but it is something to keep in mind. — G. Sliepen
@JuanJuezSarmiento: you might just want a SeqLock instead. It's hard to avoid C++ UB while getting a C++ compiler to generate safe machine code for any particular target, but fortunately this is one of the rare cases where "what the compiler doesn't know can't hurt it" because data-race UB isn't visible at compile time. See Implementing 64 bit atomic counter with 32 bit atomics and maybe also Optimal way to pass a few variables between 2 threads pinning different CPUs — Peter Cordes
What you've invented is like RCU but without the safety. Like in RCU, the read side is wait-free (never has to retry) and could be read-only if you weren't marking it as "done". But if the read side sleeps for some reasons, the array entry it's using could be overwritten before the reader is finished reading it, leading to tearing. If you know that the chance of that happening is small enough (and the resulting error isn't catastrophic), given the known write speeds and other details of your use-case, then yeah this might be a good design. — Peter Cordes

ixSci ixSci · Accepted Answer · 2019-08-17T10:28:28

Memory order is not about when you see some particular change to an atomic object but rather about what this change can guarantee about the surrounding code. Relaxed atomics guarantee nothing except the change to the atomic object itself: the change will be atomic. But you can't use relaxed atomics in any synchronization context.

And you have some code which requires synchronization. You want to pop something that was pushed and not trying to pop what has not been pushed yet. So if you use a relaxed operation then there is no guarantee that your pop will see this push code:

m_buffer[nextPos] = element;
m_position.store(nextPos, std::memory_relaxed);

as it is written. It just as well can see it this way:

m_position.store(nextPos, std::memory_relaxed);
m_buffer[nextPos] = element;

So you might try to get an element from the buffer which is not there yet. Hence, you have to use some synchronization and at least use acquire/release memory order.

And to your actual code. I think the order can be as follows:

const char wPos = m_position.load(std::memory_order_relaxed);
...
m_position.store(nextPos, std::memory_order_release);
...
const char wPos = m_position.exchange(-1, memory_order_acquire);