What is the fastest race free method for polling a lockless queue?

Question

Say we have a single-producer-thread single-consumer-thread lockless queue, and that the producer may go long periods without producing any data. It would be beneficial to let the consumer thread sleep when there is nothing in the queue (for the power savings and freeing up the CPU for other processes/threads). If the queue were not lockless, the straightforward way to solve this problem is to have the producing thread lock a mutex, do its work, signal a condition variable and unlock, and for the reading thread to lock the mutex, wait on the condition variable, do its reading, then unlock. But if we're using a lockless queue, using a mutex the exact same way would eliminate the performance we gain from using a lockless queue in the first place.

The naive solution is to have the producer after each insertion into the queue lock the mutex, signal the condition variable, then unlock the mutex, keeping the actual work (the insertion into the queue) completely outside the lock, and to have the consumer do the same, locking the mutex, waiting on the condition variable, unlocking it, pulling everything off the queue, then repeat, keeping the reading of the queue outside the lock. There's a race condition here though: between the reader pulling off the queue and going to sleep, the producer may have inserted an item into the queue. Now the reader will go to sleep, and may stay so indefinitely until the producer inserts another item and signals the condition variable again. This means you can occasionally end up with particular items seeming to take a very long time to travel through the queue. If your queue is always constantly active this may not be a problem, but if it were always active you could probably forget the condition variable entirely.

AFAICT the solution is for the producer to behave the same as if it were working with a regular needs-locking queue. It should lock the mutex, insert into the lockless queue, signal the condition variable, unlock. However, the consumer should behave differently. When it wakes, it should unlock the mutex immediately instead of waiting until it's read the queue. Then it should pull as much of the queue as it can and process it. Finally, only when the consumer is thinking of going to sleep, should it lock the mutex, check if there's any data, then if so unlock and process it or if not then wait on the condition variable. This way the mutex is contended less often than it would be with a lockfull queue, but there's no risk of going to sleep with data still left on the queue.

Is this the best way to do it? Are there alternatives?

Note: By 'fastest' I really mean 'fastest without dedicating a core to checking the queue over and over,' but that wouldn't fit in the title ;p

One alternative: Go with the naive solution, but have the consumer wait on the condition variable with a timeout corresponding to the maximum latency you are willing to tolerate for an item traveling through the queue. If the desired timeout is fairly short though, it may be below the minimum wait time for your OS or still consume too much CPU.

Can't you have the producer signal the condition variable every time it produces something? Why does it need a mutex? — Gabe
@Gabe: Two reasons. First, in this case, because the producer may produce something and fire the signal between when the consumer finishes processing an item and when it decides to wait on the condition variable. Then the consumer will go to sleep and the item the producer made will be stuck on the queue until the next time it fires the signal. Second, because at least in the pthreads API, you can't use condition variables without mutexes. You actually have to pass a mutex in to the wait function. I don't know whether all implementations of condition variables actually require them though. — Joseph Garvin
@Gabe: One misconception that might lead you to think that is thinking that if a signal is fired when nothing is waiting on the condition variable, that the next time something waits on the condition variable it will be woken up instantly, but that isn't the case. If you're not waiting on the condition variable when the signal fires, as far as you know it never happened. In this sense waiting on a condition variable is different than using poll/select on a file/socket/pipe. — Joseph Garvin
Sorry, I thought your question was more generic. You should edit the question to mention that you are using pthreads. — Gabe
@Gabe: It is meant to be generic, I just can't check every threading API for every OS. I'm reasonably sure though that almost all APIs require to use a mutex with the condition variable. It's also true for boost's threading API and win32's threads. I think that you hold the lock when you wake up from a condition variable is part of their definition, which means a mutex has to be involved. — Joseph Garvin

psmears psmears · Accepted Answer · 2010-12-09T21:41:16

I'm assuming you're using the lock-free single-producer single-consumer queue from the Dr Dobbs article - or something similar - so I'll use the terminology from there.

In that case, your suggested answer in the paragraph that starts "AFAICT" is good, but I think it can be optimised slightly:

In the consumer - as you say, when the consumer has exhausted the queue and is considering sleeping (and only then), it locks the mutex, checks the queue again, and then either
- releases the mutex and carries on working, if there was a new item in the queue
- or blocks on the condition variable (releasing the mutex when it awakes to find a non-empty queue, naturally).
In the producer:
- First take a copy of last, call it saved_last
- Add the item new_item as usual, then take a copy of the divider pointer, call it saved_divider.
- If the value of saved_divider is equal to new_item, the object you just inserted, then your object has already been consumed, and your work is done.
- Otherwise, if the value of saved_divider is not equal to saved_last, then you don't need to wake up the consumer. This is because:
  - At a time strictly after you added your new object, you know that divider had not yet reached either new_item or saved_last
  - Since you started the insertion, last has only had those two values
  - The consumer only ever stops when divider is equal to last
  - Therefore the consumer must still be awake and will reach your new item before sleeping.
- Otherwise lock the mutex, signal the condvar then release the mutex. (Obtaining the mutex here ensures you don't signal the condar in the time between the consumer noticing the queue is empty, and actually blocking on the condvar.)

This ensures that, in the case where the consumer tends to remain busy, you avoid locking the mutex when you know the consumer is still awake (and not about to sleep). It also minimises the time when the mutex is held, to further reduce the possibility for contention.

The above explanation is quite wordy (because I wanted to include the explanation of why it works, rather than just what the algorithm is), but the code resulting from it should be quite simple.

Of course whether it's actually worth doing will depend on a lot of things, and I'd encourage you to measure if performance is critical for you. Good implementations of the mutex/condvar primitives internally use atomic operations for most cases, so they generally only make a kernel call (the most expensive bit!) if there's a need to - i.e. if there's a need to block, or there are definitely some threads waiting - so the time saved by not calling the mutex functions may only amount to the overhead of the library calls.

What is the fastest race free method for polling a lockless queue?

1 Answers