84
votes

I ve been googling quite a bit for a lock-free queue in C++. I found some code and some trials - but nothing that i was able to compile. A lock-free hash would also be welcome.

SUMMARY: So far i have no positive answer. There is no "production ready" library, and amazingly none of the existent libraries complies to the API of STL containers.

15
Visual Studio 2010 contains a lock free queue in <concurrent_queue.h>Rick
And there is a hash_map and unordered_map at code.msdn.com/concrtextrasRick
Note that, curiously, the term "lock-free" does not necessarily mean that there are no locks. See en.wikipedia.org/wiki/Non-blocking_algorithm for one definition.Kristopher Johnson
Wow a question asking how to solve a common yet difficult problem in multithreaded programming that has multiple solutions, generated a lot of discussion, and earned a ton of upvotes... Then 9 years later you close it as off-topic. Thanks for your contribution to StackOverflow, NathanOliver, Sir E_net4 the Wise Downvoter, Jean-François Fabre, Machavity, and gre_gor /smuusbolla
I would say that the people who closed the question probably don't understand it.Derf Skren

15 Answers

40
votes

As of 1.53, boost provides a set of lock free data structures, including queues, stacks and single-producer/single-consumer queues (i.e. ring buffers).

24
votes

The starting point would be either of Herb Sutter's DDJ articles for either a single producer and consumer or multiple ones. The code he gives (in-line starting on the second page of each article) uses the C++0x style atomic<T> template type; which you can imitate using the Boost interprocess library.

The boost code is buried in the depths of the interprocess library, but having read through the appropriate header file (atomic.hpp) the implementations for the necessary compare-and-swap operations on the systems I am familiar with look sound.

17
votes

Yes!

I wrote a lock-free queue. It has Features™:

  • Fully wait-free (no CAS loops)
  • Super fast (over a hundred million enqueue/dequeue operations per second)
  • Uses C++11 move semantics
  • Grows as needed (but only if you want it to)
  • Does lock-free memory management for the elements (using pre-allocated contiguous blocks)
  • Stand-alone (two headers plus a license and readme)
  • Compiles under MSVC2010+, Intel ICC 13, and GCC 4.7.2 (and should work under any C++11 fully-compliant compiler)

It's available on GitHub under the simplified BSD license (feel free to fork it!).

Caveats:

  • Only for single-producer single-consumer architecture (i.e. two threads)
  • Thoroughly tested on x86(-64), and should work on ARM, PowerPC, and other CPUs where aligned native-sized integers and pointer loads and stores are naturally atomic, but has not been field tested on non-x86 CPUs (if someone has one to test it on let me know)
  • No idea if any patents are infringed upon (use at your own risk, etc.). Note that I did design and implement it myself from scratch.
16
votes

Facebook's Folly seems to have lock free data structures based on C++11 <atomic>:

I would dare to say these are currently used in production, so I guess they could safely be used in other projects.

Cheers!

11
votes

There is such a library, but it's in C.

Wrapping to C++ should be straightforward.

liblfds

10
votes

boost.lockfree is an attempt to create c++ implementations of lockfree stack and fifo classes.

public git repository

9
votes

After having checked most of the given answers, i can only state:

The answer is NO.

There is no such thing right that could be used right out of the box.

6
votes

The closest thing I am aware of is Windows Interlocked Singly Linked Lists. Of course, it is Windows only.

5
votes

If you have a multiple-producer / single-consumer Queue/FIFO, you can easily make one LockFree using SLIST or a trivial Lock Free LIFO stack. What you do is have a second "private" stack for the consumer (which can also be done as a SLIST for simplicity or any other stack model you choose). The consumer pops items off the private stack. Whenever the private LIFO is exhasted, you do a Flush rather than Pop off the shared concurrent SLIST (grabbing the entire SLIST chain) and then walk the Flushed list in-order pushing items onto the private stack.

That works for single-producer / single-consumer and for multiple-producer / single-consumer.

However, it does not work for multiple-consumer cases (with either single-producer or multiple-producers).

Also, as far as hash tables go, they are an ideal candidate for "striping" which is just dividing the hash into segments having a lock per segments of the cache. This is how the Java concurrent library does it (using 32-stripes). If you have a light-weight reader-writer lock, the hash table can be concurrently accessed for simultaneous reads and you will only stall when write are occurring on contested stripes (and possibly if you allow for growing the hash table).

If you roll your own, make sure to interleave your locks with the hash entries rather than put all your locks in an array next to each other so you're less likely to have false sharing.

4
votes

I may come a bit late on this.

The absence of solutions (at the question was asked) are mainly due to an important issue in C++ (before C++0x/11): C++ have (has) no concurrent memory model.

Now, using std::atomic, you can control memory ordering issues and have proper compare-and-swap operations. I've written myself an implementation of Micheal&Scott's lock-free queue (PODC96) using C++11 and Micheal's Hazard Pointers (IEEE TPDS 2004) to avoid early free and ABA problems. It's working fine but it's a quick and dirty implementation and I'm not satisfied with the actual performances. Code is available on bitbucket: LockFreeExperiment

It's also possible to implement lock-free queue without hazard pointers using double-words CAS (but 64bit versions will only be possible on x86-64 using cmpxchg16b), I've a blog post about that (with untested code for the queue) here: Implementing generic double-word compare and swap for x86/x86-64 (LSE blog.)

My own benchmark show me that double-locked queue (also in Micheal&Scott 1996 paper) performs as well as the lock-free one (I haven't reach enough contention so that locked data structures have performances issues, but my bench are too light for now) and the concurrent queue from Intel's TBB seems even better (two time faster) for a relatively small number (depending on the operating system, under FreeBSD 9, the lowest bound I've found so far, this number is 8 threads on an i7 with 4 ht-core, and thus 8 logical CPUs) of threads and have very strange behavior (execution time of my simple benchmark move from seconds to hours !)

Another limitations about lock-free queues following the STL style: having iterators on lock-free queue has no sens.

3
votes

And then Intel Threading Building Blocks came. And for a time, it was good.

PS : you are looking for concurrent_queue and concurrent_hash_map

1
votes

To the best of my knowledge, there is no such thing publicly available yet. One issue an implementor needs to solve is that you need a lock-free memory allocator, which exists, though I cannot seem to find the link right now.

1
votes

The following is from Herb Sutter's article on Concurrent lock free Queue http://www.drdobbs.com/parallel/writing-a-generalized-concurrent-queue/211601363?pgno=1 . I have made some changes like compiler reordering stuff. One needs GCC v4.4+ to compile this code.

#include <atomic>
#include <iostream>
using namespace std;

//compile with g++ setting -std=c++0x

#define CACHE_LINE_SIZE 64

template <typename T>
struct LowLockQueue {
private:
    struct Node {
    Node( T* val ) : value(val), next(nullptr) { }
    T* value;
    atomic<Node*> next;
    char pad[CACHE_LINE_SIZE - sizeof(T*)- sizeof(atomic<Node*>)];
    };
    char pad0[CACHE_LINE_SIZE];

// for one consumer at a time
    Node* first;

    char pad1[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among consumers
    atomic<bool> consumerLock;

    char pad2[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

// for one producer at a time
    Node* last;

    char pad3[CACHE_LINE_SIZE
          - sizeof(Node*)];

// shared among producers
    atomic<bool> producerLock;

    char pad4[CACHE_LINE_SIZE
          - sizeof(atomic<bool>)];

public:
    LowLockQueue() {
    first = last = new Node( nullptr );
    producerLock = consumerLock = false;
    }
    ~LowLockQueue() {
    while( first != nullptr ) {      // release the list
        Node* tmp = first;
        first = tmp->next;
        delete tmp->value;       // no-op if null
        delete tmp;
    }
    }

    void Produce( const T& t ) {
    Node* tmp = new Node( new T(t) );
    asm volatile("" ::: "memory");                            // prevent compiler reordering
    while( producerLock.exchange(true) )
        { }   // acquire exclusivity
    last->next = tmp;         // publish to consumers
    last = tmp;             // swing last forward
    producerLock = false;       // release exclusivity
    }

    bool Consume( T& result ) {
    while( consumerLock.exchange(true) )
        { }    // acquire exclusivity
    Node* theFirst = first;
    Node* theNext = first-> next;
    if( theNext != nullptr ) {   // if queue is nonempty
        T* val = theNext->value;    // take it out
        asm volatile("" ::: "memory");                            // prevent compiler reordering
        theNext->value = nullptr;  // of the Node
        first = theNext;          // swing first forward
        consumerLock = false;             // release exclusivity
        result = *val;    // now copy it back
        delete val;       // clean up the value
        delete theFirst;      // and the old dummy
        return true;      // and report success
    }
    consumerLock = false;   // release exclusivity
    return false;                  // report queue was empty
    }
};

int main(int argc, char* argv[])
{
    //Instead of this Mambo Jambo one can use pthreads in Linux to test comprehensively
LowLockQueue<int> Q;
Q.Produce(2);
Q.Produce(6);

int a;
Q.Consume(a);
cout<< a << endl;
Q.Consume(a);
cout<< a << endl;

return 0;
}
0
votes

I wrote this at some point probably back in 2010, I'm sure with help from different references. It Multi-Producer Single Consumer.

template <typename T>
class MPSCLockFreeQueue 
{
private:
    struct Node 
    {
        Node( T val ) : value(val), next(NULL) { }
        T value;
        Node* next;
    };
    Node * Head;               
    __declspec(align(4)) Node * InsertionPoint;  //__declspec(align(4)) forces 32bit alignment this must be changed for 64bit when appropriate.

public:
    MPSCLockFreeQueue() 
    {
        InsertionPoint = new Node( T() );
        Head = InsertionPoint;
    }
    ~MPSCLockFreeQueue() 
    {
        // release the list
        T result;
        while( Consume(result) ) 
        {   
            //The list should be cleaned up before the destructor is called as there is no way to know whether or not to delete the value.
            //So we just do our best.
        }
    }

    void Produce( const T& t ) 
    {
        Node * node = new Node(t);
        Node * oldInsertionPoint = (Node *) InterLockedxChange((volatile void **)&InsertionPoint,node);
        oldInsertionPoint->next = node;
    }

    bool Consume( T& result ) 
    {
        if (Head->next)
        {
            Node * oldHead = Head;
            Head = Head->next;
            delete oldHead;
            result = Head->value;
            return true;
        }       
        return false;               // else report empty
    }

};