Your writer only needs release
, not seq-cst, but relaxed
is too weak. You can't publish a value for m_position
until after the non-atomic assignment to the corresponding m_buffer[]
entry. You need release ordering to make sure the m_position
store is visible to other threads only after all earlier memory operations. (Including the non-atomic assignment). https://preshing.com/20120913/acquire-and-release-semantics/
This has to "synchronize-with" an acquire or seq_cst load in the reader. Or at least mo_consume
in the reader.
In theory you also need wpos = m_position
to be at least acquire
(or consume
in the reader), not relaxed, because C++11's memory model is weak enough for things like value-prediction which can let the compiler speculatively use a value for wPos
before the load actually takes a value from coherent cache.
(In practice on real CPUs, a crazy compiler could do this with test/branch to introduce a control dependency, allowing branch prediction + speculative execution to break the data dependency for a likely value of wPos
.)
But with normal compilers don't do that. On CPUs other than DEC Alpha, the data dependency in the source code of wPos = m_position
and then using m_buffer[wPos]
will create a data dependency in the asm, like mo_consume
is supposed to take advantage of. Real ISAs other than Alpha guarantee dependency-ordering for dependent loads. (And even on Alpha, using a relaxed atomic exchange might be enough to close the tiny window that exists on the few real Alpha CPUs that allow this reordering.)
When compiling for x86, there's no downside at all to using mo_acquire
; it doesn't cost any extra barriers. There can be on other ISAs, like 32-bit ARM where acquire
costs a barrier, so "cheating" with a relaxed load could be a win that's still safe in practice. Current compilers always strengthen mo_consume
to mo_acquire
so we unfortunately can't take advantage of it.
You already have a real-word race condition even using seq_cst
.
- initial state:
m_position = 0
- reader "claims" slot 0 by exchanging in a
m_position = -1
and reads part of m_buffer[0];
- reader sleeps for some reason (e.g. timer interrupt deschedules it), or simply races with a writer.
- writer reads
wPos = m_position
as -1
, and calculates nextPos = 0
.
It overwrites the partially-read m_buffer[0]
- reader wakes up and finishes reading, getting a torn
T &element
. Data race UB in the C++ abstract machine, and tearing in practice.
Adding a 2nd check of m_position
after the read (like a SeqLock) can't detect this in every case because the writer doesn't update m_position
until after writing the buffer element.
Even though you your real use-case has long gaps between reads and writes, this defect can bite you with just one read and write happening at almost the same time.
I for sure know that the read side cannot wait for nothing and cannot be stopped (it's audio) and it's poped each 5-10ms, and the write side is the user input, which is more slower, a faster one could do a push once each 500ms.
A millisecond is ages on a modern CPU. Inter-thread latency is often something like 60 ns, so fractions of a microsecond, e.g. from a quad-core Intel x86. As long as you don't sleep on a mutex, it's not a problem to spin-retry once or twice before giving up.
Code review:
The class similar to a LIFO, but once the pop() function is called, it only return the last written element of its ring-buffer (only if there are new elements since last pop()).
This isn't a real queue or stack: push and pop aren't great names. "publish" and "read" or "get" might be better and make it more obvious what this is for.
I'd include comments in the code to describe the fact that this is safe for a single writer, multiple readers. (The non-atomic increment of m_position
in push
makes it clearly unsafe for multiple writers.)
Even so, it's kinda weird even with 1 writer + 1 reader running at the same time. If a read starts while a write is in progress, it will get the "old" value instead of spin-waiting for a fraction of a microsecond to get the new value. Then next time it reads there will already be a new value waiting; the one it just missed seeing last time. So e.g. m_position
can update in this order: 2, -1, 3.
That might or might not be desirable, depending on whether "stale" data has any value, and on acceptability of the reader blocking if the writer sleeps mid-write. Or even without the writer sleeping, of spin-waiting.
The standard pattern for rarely written smallish data with multiple read-only readers is a SeqLock. e.g. for publishing a 128-bit current timestamp on a CPU that can't atomically read or write a 128-bit value. See Implementing 64 bit atomic counter with 32 bit atomics
Possible design changes
To make this safe, we could let the writer run free, always wrapping around its circular buffer, and have the reader keep track of the last element it looked at.
If there's only one reader, this should be a simple non-atomic variable. If it's an instance variable, at least put it on the other side of m_buffer[]
from the write-position.
// Possible failure mode: writer wraps around between reads, leaving same m_position
// single-reader
const bool read(T &elem)
{
// FIXME: big hack to get this in a separate cache line from the instance vars
// maybe instead use alignas(64) int m_lastread as a class member, and/or on the other side of m_buffer from m_position.
static int lastread = -1;
int wPos = m_position.load(std::memory_order_acquire); // or cheat with relaxed to get asm that's like "consume"
if (lastread == wPos)
return false;
elem = m_buffer[wPos];
lastread = wPos;
return true;
}
You want lastread
in a separate cache line from the stuff the writer writes. Otherwise the reader's updates of readPos will be slower because of false-sharing with the writer's writes and vice versa.
This lets the reader(s) be truly read-only wrt. the cache lines written by the writer. It will still take MESI traffic to request read access to lines that are in Modified state after the writer writes them, though. But the writer can still read m_position
with no cache miss, so it can get its stores into the store buffer right away. It only has to wait for an RFO to get exclusive ownership of the cache line(s) before it can commit the element and the updated m_position
from its store buffer to L1d cache.
TODO: let m_position
increment without manual wrapping, so we have a write sequence number that takes a very long time to wrap around, avoiding false-negative early out from lastread == wPos
.
Use wPos & (maxElements-1)
as the index. And static_assert(maxElements & (maxElements-1) == 0, "maxElements must be a power of 2");
Then the only danger is undetected tearing in a tiny time-window if the writer has wrapped all the way around and is writing the element being read. For frequent reads and infrequent writes, and a buffer that's not too small, this should never happen. Checking the m_position
again after a read (like a SeqLock, similar to below) narrows the race window to only writes that are still in progress.
If there are multiple readers, another good option might be a claimed
flag in each m_buffer
entry. So you'd define
template<typename T>
class WaitFreePublish
{
private:
struct {
alignas(32) T elem; // at most 2 elements per cache line
std::atomic<int8_t> claimed; // writers sets this to 0, readers try to CAS it to 1
// could be bool if we don't end up needing 3 states for anything.
// set to "1" in the constructor? or invert and call it "unclaimed"
} m_buffer[maxElements];
std::atomic<int> m_position {-1};
}
If T
has padding at the end, it's a shame we can't take advantage of that for the claimed
flag :/
This avoids the possible failure mode of comparing positions: if the writer wraps around between reads, the worst we get is tearing. And we could detect such tearing by having the writer clear the claimed
flag first, before writing the rest of the element.
With no other threads writing m_position
, we can definitely use a relaxed load without worry. We could even cache the write-position somewhere else, but the reader hopefully isn't invalidating the cache-line containing m_position
very often. And apparently in your use-case, writer performance/latency probably isn't a big deal.
So the writer + reader could look like this, with SeqLock-style tearing detection using the known update-order for claimed flag, element, and m_position.
/// claimed flag per array element supports concurrent readers
// thread-safety: single-writer only
// update claimed flag first, then element, then m_position.
void publish(const T& elem)
{
const int wPos = m_position.load(std::memory_order_relaxed);
const int nextPos = getNextPos(wPos);
m_buffer[nextPos].claimed.store(0, std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_release); // make sure that `0` is visible *before* the non-atomic element modification
m_buffer[nextPos].elem = elem;
m_position.store(nextPos, std::memory_order_release);
}
// thread-safety: multiple readers are ok. First one to claim an entry gets it
// check claimed flag before/after to detect overwrite, like a SeqLock
const bool read(T &elem)
{
int rPos = m_position.load(std::memory_order_acquire);
int8_t claimed = m_buffer[rPos].claimed.load(std::memory_order_relaxed);
if (claimed != 0)
return false; // read-only early-out
claimed = 0;
if (!m_buffer[rPos].claimed.compare_exchange_strong(
claimed, 1, std::memory_order_acquire, std::memory_order_relaxed))
return false; // strong CAS failed: another thread claimed it
elem = m_buffer[rPos].elem;
// final check that the writer didn't step on this buffer during read, like a SeqLock
std::atomic_thread_fence(std::memory_order_acquire); // LoadLoad barrier
// We expect it to still be claimed=1 like we set with CAS
// Otherwise we raced with a writer and elem may be torn.
// optionally retry once or twice in this case because we know there's a new value waiting to be read.
return m_buffer[rPos].claimed.load(std::memory_order_relaxed) == 1;
// Note that elem can be updated even if we return false, if there was tearing. Use a temporary if that's not ok.
}
Using claimed = m_buffer[rPos].exchange(1)
and checking for claimed==0
would be another option, vs. CAS-strong. Maybe slightly more efficient on x86. On LL/SC machines I guess CAS might be able to bail out without doing a write at all if it finds a mismatch with expected
, in which case the read-only check is pointless.
I used .claimed.compare_exchange_strong(claimed, 1)
with success ordering = acquire
to make sure that read of claimed
happens-before reading .elem
.
The "failure" memory ordering can be relaxed
: If we see it already claimed by another thread, we give up and don't look at any shared data.
The memory-ordering of the store part of compare_exchange_strong
can be relaxed
, so we just need mo_acquire
, not acq_rel
. Readers don't do any other stores to the shared data, and I don't think the ordering of the store matters wrt. to the loads. CAS is an atomic RMW. Only one thread's CAS can succeed on a given buffer element because they're all trying to set it from 0 to 1. That's how atomic RMWs work, regardless of being relaxed or seq_cst or anything in between.
It doesn't need to be seq_cst: we don't need to flush the store buffer or whatever to make sure the store is visible before this thread reads .elem
. Just being an atomic RMW is enough to stop multiple threads from actually thinking they succeed. Release would just make sure it can't move earlier, ahead of the relaxed read-only check. That wouldn't be a correctness problem. Hopefully no x86 compilers would do that at compile time. (At runtime on x86, RMW atomic operations are always seq_cst.)
I think being an RMW makes it impossible for it to "step on" a write from a writer (after wrapping around). But this might be real-CPU implementation detail, not ISO C++. In the global modification order for any given .claimed
, I think the RMW stays together, and the "acquire" ordering does keep it ahead of the read of the .elem
. A release
store that wasn't part of a RMW would be a potential problem though: a writer could wrap around and put claimed=0
in a new entry, then the reader's store could eventually commit and set it to 1, when actually no reader has ever read that element.
If we're very sure the reader doesn't need to detect writer wrap-around of the circular buffer, leave out the std::atomic_thread_fence
in the writer and reader. (The claimed and the non-atomic element store will still be ordered by the release-store to m_position
). The reader can be simplified to leave out the 2nd check and always return true if it gets past the CAS.
Notice that m_buffer[nextPos].claimed.store(0, std::memory_order_release);
would not be sufficient to stop later non-atomic stores from appearing before it: release-stores are a one-way barrier, unlike release fences. A release-fence is like a 2-way StoreStore barrier. (Free on x86, cheap on other ISAs.)
This SeqLock-style tearing detection doesn't technically avoid UB in the C++ abstract machine, unfortunately. There's no good / safe way to express this pattern in ISO C++, and it's known to be safe in asm on real hardware. Nothing actually uses the torn value (assuming read()
's caller ignores its elem
value if it returns false).
Making elem
a std::atomic<T>
would be defeat the entire purpose: that would use a spinlock to get atomicity so it might as well use it directly.
Using volatile T elem
would break buffer[i].elem = elem
because unlike C, C++ doesn't allow copying a volatile struct to/from a regular struct. (volatile struct = struct not possible, why?). This is highly annoying for a SeqLock type of pattern where you'd like the compiler to emit efficient code to copy the whole object representation, optionally using SIMD vectors. You won't get that if you write a constructor or assignment operator that takes a volatile &T
argument and does individual members. So clearly volatile
is the wrong tool, and that only leaves compiler memory barriers to make sure the non-atomic object is fully read or fully written before the barrier. std::atomic_thread_fence
is I think actually safe for that, like asm("" ::: "memory")
in GNU C. It works in practice on current compilers.
char
might be either signed or unsigned if you don't explicitly specify it. It actually doesn't matter in your code, but it is something to keep in mind. – G. Sliepen