Optimal memory ordering for lock free bag algorithm

Question

What are the optimal settings for the NQFENCE and DQFENCE memory barriers in the following C++11 encoded algorithm for a lock free bag?

Description: this algorithm is an (otherwise) near optimal multi-producer multi-consumer lock free queue intended for feeding a thread pool with non-null pointers. Its correctness is trivially obvious (modulo bugs!). It is not linearisable, that is, data enqueued by a single thread my be dequeued out of order so it is not suitable for a single consumer queue (or environments where consumers are not independent).

Surprisingly (to me at least!) it appears to be almost optimal (in general) as well (whatever that means). It also has some really nasty properties such as the possibility of a writer thread failing to enqueue data indefinitely, and the possibility of all but one writer threads running on different CPUs indefinitely stalling those CPUs. Nevertheless that property appears universal (true for all possible implementations). Ditto readers.

Explanation: dequeue operations start at slot zero and swap a NULL pointer for the value there. If the result is non-NULL return it, having stuck a NULL there. Otherwise we swapped a NULL for NULL, so we increment the slot index and retry after sleeping.

Enqueue operations also start at slot zero and swap the data to store with the value there. If the result is NULL, we have done our work and return. Otherwise, we replaced some other pointer by mistake, so we increment the index, sleep a bit, and proceed by trying to put that value back in the queue.

We do not keep track of the head or tail locations because that would require extra constraints and harm performance in uncontended operations. When there is contention, extra time may be required searching the array for a suitable slot, however this may be desirable!

We could use approximate head and tail tracking: this would involve atomic read and writes of index position with relaxed (that is, no) memory ordering. The atomicity is only required to ensure a whole value is written. These indices would not be accurate, but this does not impact the correctness of the algorithm. However it is not entirely clear this modification would actually improve performance.

This algorithm is interesting because unlike other complicated algorithms only a single CAS operation is required for each of the enqueue and dequeue methods.

#include <atomic>
#include <stdlib.h>

#define NQFENCE ::std::memory_order_cst
#define DQFENCE ::std::memory_order_cst

struct bag {
  ::std::atomic <void *> volatile *a;
  size_t n;

  bag (size_t n_) : n (n_), 
    a((::std::atomic<void*>*)calloc (n_ , sizeof (void*))) 
  {}

  void enqueue(void *d) 
  { 
    size_t i = 0;
    while 
    (
      (d = ::std::atomic_exchange_explicit(a + i, d, 
        NQFENCE))
    ) 
    { 
      i = (i + 1) % n; 
      SLEEP;
    }
  }
  void *dequeue () 
  { 
    size_t i = 0;
    void *d = nullptr;
    while 
    (
      !(d = ::std::atomic_exchange_explicit(a + i, d, 
        DQFENCE))
    ) 
    { 
      i = (i + 1) % n; 
      SLEEP;
    }
    return d;
  }
};

This question is only practical on non-Intel HW since on Intel HW, CAS has memory_order_cst semantics anyway. — Anton
i am building a platform independent product, so it matters. It may (one day!) run on ARM for example. — Yttrill

Tsyvarev Tsyvarev · Accepted Answer · 2015-10-15T12:55:06

If value which is stored in the bag is used "as is" by outer code (e.g. printf("Value: %p", value);), then no memory order constraints are required; that is both NQFENCE and DQFENCE may be just ::std::memory_order_relaxed.

Otherwise, (e.g value is a pointer to struct/object, which fields have a sence), NQFENCE should be ::std::memory_order_release for make sure that object's fields are initialized before publishing of the object. As for DQFENCE, it can be ::std::memory_order_consume in simple object-with-fields case, so every value's field will be fetched after the value itself. In the common case ::std::memory_order_acquire should be used for DQFENCE. So, every memory operation performed by producer before publishing value will be seen by consumer.

When talk about performance, it is sufficient to have NQFENCE only at the first iteration in the enqueue(), other iterations may safetly use ::std::memory_order_relaxed:

 void enqueue(void *d) 
  { 
    size_t i = 0;
    if(d = ::std::atomic_exchange_explicit(a + i, d, 
        NQFENCE))
    {
      do
      {
        i = (i + 1) % n; 
        SLEEP;
      } while(d = ::std::atomic_exchange_explicit(a + i, d, 
        ::std::memory_order_relaxed));
    }
  }

Similary, only the last iteration in dequeue() requires DQFENCE. Since last iteration can only be detected after atomic operation, there is no universal optimization for this case. You can use additional fence instead of memory order:

void *dequeue () 
  { 
    size_t i = 0;
    void *d = nullptr;
    while 
    (
      !(d = ::std::atomic_exchange_explicit(a + i, d, 
        ::std::memory_order_relaxed))
    ) 
    { 
      i = (i + 1) % n; 
      SLEEP;
    }

    ::std::atomic_thread_fence(DQFENCE);
    return d;
  }

This will gain perfomance in case when atomic_exchange_explicit is actually faster with relaxed order, but will lost perfomance if this operation already implies sequential ordering(see Anton's comment).

Optimal memory ordering for lock free bag algorithm

1 Answers