Lock-Free Queue with boost::atomic - Am I doing this right?

Question

Short version:

I'm trying to replace std::atomic from C++11 used in a lock-free, single producer, single consumer queue implementation from here. How do I replace this with boost::atomic?

Long version:

I'm trying to get a better performance out of our app with worker threads. Each thread has its own task queue. We have to synchronize using lock before dequeue/enqueue each task.

Then I found Herb Sutter's article on lock-free queue. It seems like an ideal replacement. But the code uses std::atomic from C++11, which I couldn't introduce to the project at this time.

More googling led to some examples, such as this one for Linux (echelon's), and this one for Windows (TINESWARE's). Both use platform's specific constructs such as WinAPI's InterlockedExchangePointer, and GCC's __sync_lock_test_and_set.

I only need to support Windows & Linux so maybe I can get away with some #ifdefs. But I thought it might be nicer to use what boost::atomic provides. Boost Atomic is not part of official Boost library yet. So I downloaded the source from http://www.chaoticmind.net/~hcb/projects/boost.atomic/ and use the include files with my project.

This is what I get so far:

#pragma once

#include <boost/atomic.hpp>

template <typename T>
class LockFreeQueue
{
private:
    struct Node
    {
        Node(T val) : value(val), next(NULL) { }
        T value;
        Node* next;
    };
    Node* first; // for producer only
    boost::atomic<Node*> divider;  // shared
    boost::atomic<Node*> last; // shared

public:
    LockFreeQueue()
    {
        first = new Node(T());
        divider = first;
        last= first;
    }

    ~LockFreeQueue()
    {
        while(first != NULL) // release the list
        {
            Node* tmp = first;
            first = tmp->next;
            delete tmp;
        }
    }

    void Produce(const T& t)
    {
        last.load()->next = new Node(t); // add the new item
        last = last.load()->next;
        while(first != divider) // trim unused nodes
        {
            Node* tmp = first;
            first = first->next;
            delete tmp;
        }
    }

    bool Consume(T& result)
    {
        if(divider != last) // if queue is nonempty
        {
            result = divider.load()->next->value; // C: copy it back
            divider = divider.load()->next;
            return true;  // and report success
        }
        return false;  // else report empty
    }
};

Some modifications to note:

boost::atomic<Node*> divider;  // shared
boost::atomic<Node*> last; // shared

and

    last.load()->next = new Node(t); // add the new item
    last = last.load()->next;

and

        result = divider.load()->next->value; // C: copy it back
        divider = divider.load()->next;

Am I applying the load() (and the implicit store()) from boost::atomic correctly right here? Can we say this is equivalent to Sutter's original C++11 lock-free queue?

PS. I studied many of the threads on SO, but none seems to provide an example for boost::atomic & lock-free queue.

Before you start poking around the lock-free hornet's nest, consider giving larger payloads to your worker threads so that the time spent on mutex contention becomes less significant. For example, if you are processing a stream of data, gather the data into large chunks before passing it on to a worker thread. — Emile Cormier
FWIF, please note, that Lockfree library was accepted to Boost: lists.boost.org/boost-announce/2011/08/0331.php tim.klingt.org/boost_lockfree — Igor R.
As Igor R mentioned, the lockfree queue is already available in boost. You can use it directly. — Kemin Zhou

Unknown Unknown · Accepted Answer · 2012-05-24T16:54:42

Have you tried Intel Thread Building Blocks' atomic<T>? Cross platform and free.

Also...

Single producer/single consumer makes your problem much easier because your linearization point can be a single operator. It becomes easier still if you are prepared to accept a bounded queue.

A bounded queue offers advantages for cache performance because you can reserve a cache aligned memory block to maximize your hits, e.g.:

#include <vector>
#include "tbb/atomic.h"
#include "tbb/cache_aligned_allocator.h"    

template< typename T >
class SingleProdcuerSingleConsumerBoundedQueue { 
    typedef vector<T, cache_aligned_allocator<T> > queue_type;

public:
    BoundedQueue(int capacity):
        queue(queue_type()) {
        head = 0;
        tail = 0;
        queue.reserve(capacity);
    }

    size_t capacity() {
        return queue.capacity();
    }

    bool try_pop(T& result) {
        if(tail - head == 0)
            return false;
        else {
            result = queue[head % queue.capacity()];
            head.fetch_and_increment(); //linearization point
            return(true);
        }
    }

    bool try_push(const T& source) {
        if(tail - head == queue.capacity()) 
            return(false);
        else {
            queue[tail %  queue.capacity()] = source;
            tail.fetch_and_increment(); //linearization point
            return(true);
        }
    }

    ~BoundedQueue() {}

private:
    queue_type queue;
    atomic<int> head;
    atomic<int> tail;
};

Lock-Free Queue with boost::atomic - Am I doing this right?

2 Answers