11
votes

I'm writing a program with a consumer thread and a producer thread, now it seems queue synchronization is a big overhead in the program, and I looked for some lock free queue implementations, but only found Lamport's version and an improved version on PPoPP '08:

enqueue_nonblock(data) {
    if (NULL != buffer[head]) {
        return EWOULDBLOCK;
    }
    buffer[head] = data;
    head = NEXT(head);
    return 0;
}

dequeue_nonblock(data) {
    data = buffer[tail];
    if (NULL == data) {
        return EWOULDBLOCK;
    }
    buffer[tail] = NULL;
    tail = NEXT(tail);
    return 0;
}

Both versions require a pre-allocated array for the data, my question is that is there any single-consumer single-producer lock-free queue implementation which uses malloc() to allocate space dynamically?

And another related question is, how can I measure exact overhead in queue synchronization? Such as how much time it takes of pthread_mutex_lock(), etc.

9

9 Answers

7
votes

If you are worried about performance, adding malloc() to the mix won't help things. And if you are not worried about performance, why not simply control access to the queue via a mutex. Have you actually measured the performance of such an implementation? It sounds to me as though you are going down the familar route of premature optimisation.

4
votes

The algorithm you show manages to work because although the two threads share the resource (i.e., the queue), they share it in a very particular way. Because only one thread ever alters the head-index of the queue (the producer), and only one thread every alters the tail-index (consumer, of course), you can't get an inconsistent state of the shared object. It's also important that the producer put the actual data in before updating the head index, and that the consumer reads the data it wants before updating the tail index.

It works as well as it does b/c the array is quite static; both threads can count on the storage for the elements being there. You probably can't replace the array entirely, but what you can do is change what the array is used for.

I.e., instead of keeping the data in the array, use it to keep pointers to the data. Then you can malloc() and free() the data items, while passing references (pointers) to them between your threads via the array.

Also, posix does support reading a nanosecond clock, although the actual precision is system dependent. You can read this high resolution clock before and after and just subtract.

3
votes

Yes.

There exist a number of lock-free multiple-reader multiple-writer queues.

I have implemented one, by Michael and Scott, from their 1996 paper.

I will (after some more testing) be releasing a small library of lock-free data structures (in C) which will include this queue.

3
votes

You should look at FastFlow library

2
votes

I recall seeing one that looked interesting a few years ago, though I can't seem to find it now. :( The lock-free implementation that was proposed did require use of a CAS primitive, though even the locking implementation (if you didn't want to use the CAS primitive) had pretty good perf characteristics--- the locks only prevented multiple readers or multiple producers from hitting the queue at the same time, the producer still never raced with the consumer.

I do remember that the fundamental concept behind the queue was to create a linked list that always had one extra "empty" node in it. This extra node meant that the head and the tail pointers of the list would only ever refer to the same data when the list was empty. I wish I could find the paper, I'm not doing the algorithm justice with my explanation...

AH-ha!

I've found someone who transcribed the algorithm without the remainder of the article. This could be a useful starting point.

2
votes

I've worked with a fairly simple queue implementation the meets most of your criteria. It used a static maximum size pool of bytes, and then we implemented messages within that. There was a head pointer that one process would move, and and a tail pointer that the other process would move.

Locks were still required, but we used Peterson's 2-Processor Algorithm, which is pretty lightweight since it doesn't involve system calls. The lock is only required for very small, well-bounded area: a few CPU cycles at most, so you never block for long.

2
votes

I think the allocator can be a performance problem. You can try to use a custom multithreaded memory allocator, that use a linked-list for maintaing freed blocks. If your blocks are not (nearly) the same size, you can implement a "Buddy system memory allocator", witch is very fast. You have to synchronise your queue (ring buffer) with a mutex.

To avoid too much synchronisation, you can try write/read multiple values to/from the queue at each access.

If you still want to use, lock-free algorithms, then you must use pre-allocated data or use a lock-free allocator. There is a paper about a lock-free allocator "Scalable Lock-Free Dynamic Memory Allocation", and an implementation Streamflow

Before starting with Lock-free stuff, look at:Circular lock-free buffer

2
votes

Adding malloc would kill any performance gain you may make and a lock based structure would be just as effective. This is so because malloc requires some sort of CAS lock over the heap and hence some forms of malloc have their own lock so you may be locking in the Memory Manager.

To use malloc you would need to pre allocate all the nodes and manage them with another queue...

Note you can make some form of expandable array which would need to lock if it was expanded.

Also while interlocked are lock free on the CPU they do placea memory lock and block memory for the duration of the instruction and often stall the pipeline.

0
votes

This implementation uses C++'s new and delete which can trivially be ported to the C standard library using malloc and free:

http://www.drdobbs.com/parallel/writing-lock-free-code-a-corrected-queue/210604448?pgno=2