1
votes

I'm trying to optimise for consumer latency in an SPSC queue like this:

template <typename TYPE>
class queue
{
public:

    void produce(message m)
    {
        const auto lock = std::scoped_lock(mutex);
        has_new_messages = true;
        new_messages.emplace_back(std::move(m));
    }

    void consume()
    {
        if (UNLIKELY(has_new_messages))
        {
            const auto lock = std::scoped_lock(mutex);
            has_new_messages = false;
            messages_to_process.insert(
                messages_to_process.cend(),
                std::make_move_iterator(new_messages.begin()),
                std::make_move_iterator(new_messages.end()));
            new_messages.clear();
        }

        // handle messages_to_process, and then...

        messages_to_process.clear();
    }

private:
    TYPE has_new_messages{false};
    std::vector<message> new_messages{};
    std::vector<message> messages_to_process{};

    std::mutex mutex;
};

The consumer here is trying to avoid paying for locking/unlocking of the mutex if possible and does the check before locking the mutex.

The question is: do I absolutely have to use TYPE = std::atomic<bool> or I can save on atomic operations and reading a volatile bool is fine?

It's known that a volatile variable per se doesn't guarantee thread safety, however, std::mutex::lock() and std::mutex::unlock() provide some memory ordering guarantees. Can I rely on them to make changes to volatile bool has_new_messages to be eventually visible to the consumer thread outside of the mutex scope?


Update: Following @Peter Cordes' advice, I'm rewriting this as follows:

    void produce(message m)
    {
        {
            const auto lock = std::scoped_lock(mutex);
            new_messages.emplace_back(std::move(m));
        }
        has_new_messages.store(true, std::memory_order_release);
    }

    void consume()
    {
        if (UNLIKELY(has_new_messages.exchange(false, std::memory_order_acq_rel))
        {
            const auto lock = std::scoped_lock(mutex);
            messages_to_process.insert(...);
            new_messages.clear();
        }
    }
1
You can rely on mutex lock/unlock and you do not need volatile for that, but not outside of scopeSlava
Why is has_new_messages using the template type? Shouldn't it just be bool? Or is that so you can do a bool vs. volatile bool version?Peter Cordes
@PeterCordes, yeah just for demonstration purposes. It's for bool vs volatile bool vs std::atomic<bool>.Dev Null
Spinning on xchg is sub-optimal vs. read-only with pause, I think. Maybe if(exchange) { ... } else { _mm_pause(); } would be ok, too. Normally you want to spin read-only, and only xchg if your read-only check says it should work. But exchange seems to have no advantage vs. separate read / write, if no other thread can consume it out from under you.Peter Cordes
It seems to me that a better lock / semaphore should be able to do the whole job, instead of having cache lines bouncing between cores for access to has_new_messages as well as the mutex. I don't know if some kind of counter-based lock could do the trick, though.Peter Cordes

1 Answers

3
votes

It can't be a plain bool. Your spin-loop in the reader will optimize into something like this:
if (!has_new_messages) infinite_loop; because the compiler can hoist the load out of the loop because it's allowed to assume it doesn't change asynchronously.


volatile works on some platforms (including most mainstream CPUs such as x86-64 or ARM) as a crappy alternative to atomic loads/stores with memory_order_relaxed, for types that are "naturally" atomic (e.g. int or bool, because the ABI gives them natural alignment). i.e. where lock-free atomic load/store uses the same asm as normal load/store.

I recently wrote an answer comparing volatile with relaxed atomic for an interrupt handler, but it's basically the same for actually concurrent threads. has_new_messages.load(std::memory_order_relaxed) compiles to the same asm you'd get from volatile on normal platforms (i.e. no extra fencing instructions, just a plain load or store), but it's legal / portable C++.

You can and should just use std::atomic<bool> has_new_messages; with mo_relaxed loads/stores outside the mutex, if doing the same thing with volatile would have been safe.

Your writer should probably the flag after releasing the mutex, or maybe use a memory_order_release store at the end of the critical section. There's no point having the reader break out of the spin loop and try to take the mutex when the writer hasn't actually released it yet.

BTW, if your reader thread is spinning on has_new_messages waiting for it to become true, you should use _mm_pause() in your loop on x86 to save power and avoid a memory-order mis-speculation pipeline clear when it does change. Also consider falling back to an OS-assisted sleep/wake after spinning a couple thousand times. See What does __asm volatile ("pause" ::: "memory"); do?, and for more about memory written by one thread and read by another, see What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings? (including some memory-order mis-speculation results.)


Or better, use a lockless SPSC queue; there are lots of implementations using a fixed-size ring buffer where there's no contention between reader and writer if the queue isn't full or empty. If you arrange things to the atomic position counter for the reader and writer are in separate cache lines, it should be good.


changes to volatile bool has_new_messages to be eventually visible to the consumer thread

This is a common misconception. Any store will very quickly become visible to all other CPU cores because they all share a coherent cache domain, and stores are committed to it as quickly as possible without any need for fencing instructions.

If I don't use fences, how long could it take a core to see another core's writes?. Worst case is probably about a microsecond, within an order of magnitude. Normally less.

And volatile or atomic ensure that there will actually be a store in the compiler-generated asm.

(Related: Current compilers basically don't optimize atomic<T> at all; so atomic is basically equivalent to volatile atomic. Why don't compilers merge redundant std::atomic writes?. But even without that, the compiler couldn't skip doing the store or hoist the load out of a spin loop.)