What are the optimal settings for the NQFENCE and DQFENCE memory barriers in the following C++11 encoded algorithm for a lock free bag?
Description: this algorithm is an (otherwise) near optimal multi-producer multi-consumer lock free queue intended for feeding a thread pool with non-null pointers. Its correctness is trivially obvious (modulo bugs!). It is not linearisable, that is, data enqueued by a single thread my be dequeued out of order so it is not suitable for a single consumer queue (or environments where consumers are not independent).
Surprisingly (to me at least!) it appears to be almost optimal (in general) as well (whatever that means). It also has some really nasty properties such as the possibility of a writer thread failing to enqueue data indefinitely, and the possibility of all but one writer threads running on different CPUs indefinitely stalling those CPUs. Nevertheless that property appears universal (true for all possible implementations). Ditto readers.
Explanation: dequeue operations start at slot zero and swap a NULL pointer for the value there. If the result is non-NULL return it, having stuck a NULL there. Otherwise we swapped a NULL for NULL, so we increment the slot index and retry after sleeping.
Enqueue operations also start at slot zero and swap the data to store with the value there. If the result is NULL, we have done our work and return. Otherwise, we replaced some other pointer by mistake, so we increment the index, sleep a bit, and proceed by trying to put that value back in the queue.
We do not keep track of the head or tail locations because that would require extra constraints and harm performance in uncontended operations. When there is contention, extra time may be required searching the array for a suitable slot, however this may be desirable!
We could use approximate head and tail tracking: this would involve atomic read and writes of index position with relaxed (that is, no) memory ordering. The atomicity is only required to ensure a whole value is written. These indices would not be accurate, but this does not impact the correctness of the algorithm. However it is not entirely clear this modification would actually improve performance.
This algorithm is interesting because unlike other complicated algorithms only a single CAS operation is required for each of the enqueue and dequeue methods.
#include <atomic>
#include <stdlib.h>
#define NQFENCE ::std::memory_order_cst
#define DQFENCE ::std::memory_order_cst
struct bag {
::std::atomic <void *> volatile *a;
size_t n;
bag (size_t n_) : n (n_),
a((::std::atomic<void*>*)calloc (n_ , sizeof (void*)))
{}
void enqueue(void *d)
{
size_t i = 0;
while
(
(d = ::std::atomic_exchange_explicit(a + i, d,
NQFENCE))
)
{
i = (i + 1) % n;
SLEEP;
}
}
void *dequeue ()
{
size_t i = 0;
void *d = nullptr;
while
(
!(d = ::std::atomic_exchange_explicit(a + i, d,
DQFENCE))
)
{
i = (i + 1) % n;
SLEEP;
}
return d;
}
};
memory_order_cst
semantics anyway. – Anton