Memory Barriers in Single Producer Single Consumer Queue

Question

I've spent the last few weeks doing a lot of reading on memory models, compiler reordering, CPU reordering, memory barriers, and lock-free programming and I think I've driven myself into confusion now. I've written a single producer single consumer queue and am trying to figure out where I need memory barriers for things to work and if some operations need to be atomic. My single producer single consumer queue is as follows:

typedef struct queue_node_t {
    int data;
    struct queue_node_t *next;
} queue_node_t;

// Empty Queue looks like this:
// HEAD TAIL
//   |    |
// dummy_node

// Queue: insert at TAIL, remove from HEAD
//    HEAD           TAIL
//     |              |
// dummy_node -> 1 -> 2 -> NULL

typedef struct queue_t {
    queue_node_t *head; // consumer consumes from head
    queue_node_t *tail; // producer adds at tail
} queue_t;

queue_node_t *alloc_node(int data) {
    queue_node_t *new_node = (queue_node_t *)malloc(sizeof(queue_node_t));
    new_node->data = data;
    new_node->next = NULL;
    return new_node;
}

queue_t *create_queue() {
    queue_t *new_queue = (queue_t *)malloc(sizeof(queue_t));
    queue_node_t *dummy_node = alloc_node(0);
    dummy_node->next = NULL;
    new_queue->head = dummy_node;
    new_queue->tail = dummy_node;
    // 1. Do we need any kind of barrier to make sure that if the
    // thread that didn't call this performs a queue operation
    // and happens to run on a different CPU that queue structure
    // is fully observed by it? i.e. the head and tail are properly
    // initialized
    return new_queue;
}

// Enqueue modifies tail
void enqueue(queue_t *the_queue, int data) {
    queue_node_t *new_node = alloc_node(data);
    // insert at tail
    new_node->next = NULL;

    // Let us save off the existing tail
    queue_node_t *old_tail = the_queue->tail;

    // Make the new node the new tail
    the_queue->tail = new_node;

    // 2. Store/Store barrier needed here?

    // Link in the new node last so that a concurrent dequeue doesn't see
    // the node until we're done with it
    // I don't know that this needs to be atomic but it does need to have
    // release semantics so that this isn't visible until prior writes are done
    old_tail->next = the_queue->tail;
    return;
}

// Dequeue modifies head
bool dequeue(queue_t *the_queue, int *item) {
    // 3. Do I need any barrier here to make sure if an enqueue already happened
    // I can observe it? i.e., if an enqueue was called on 
    // an empty queue by thread 0 on CPU0 and dequeue is called
    // by thread 1 on CPU1
    // dequeue the oldest item (FIFO) which will be at the head
    if (the_queue->head->next == NULL) {
        return false;
    }
    *item = the_queue->head->next->data;
    queue_node_t *old_head = the_queue->head;
    the_queue->head = the_queue->head->next;
    free(old_head);
    return true;
}

Here are my questions corresponding to the comments in my code above:

In create_queue() do I need some kind of barrier before I return? I'm wondering if I call this function from thread 0 running on CPU0 and then use the pointer returned in thread 1 which happens to run on CPU1 is it possible thread 1 sees a queue_t structure that isn't fully initialized?
Do I need a barrier in enqueue() to make sure the new node isn't linked in to the queue until all of the new node's fields are initialized?
Do I need a barrier in dequeue()? I feel like it would be correct without one but I may need one if I want to make sure I see any completed enqueue.

Update: I tried to make it clear with the comments in the code but the HEAD of this queue always points to a dummy node. This is a common technique that makes it so that the producer only ever needs to access the TAIL and the consumer only ever accesses the HEAD. An empty queue will contain a dummy node and dequeue() always returns the node after the HEAD, if there is one. As nodes are dequeued the dummy node advances and the previous "dummy" is freed.

Domso Domso · Accepted Answer · 2016-10-26T13:50:33

first of all, it depends your specific hardware-architecture, OS, language, etc.

1.) no. because you need a additional barrier to pass the pointer to the other thread anyway

2.) yes, old_tail->next = the_queue->tail needs to be executed after the_queue->tail = new_node

3.) It would have no effect, since there is nothing before the barrier, but theoretically you could need a barrier after old_tail->next = the_queue->tail in enqueue(). The compiler wont reorder outside of a function, but the CPU could maybe do something like that. (very unlikely, but not 100% sure)

OT: since you are already doing some micro-optimization, you could add some padding for the cache

typedef struct queue_t {
    queue_node_t *head; // consumer consumes from head
    char cache_pad[64]; // head and tail shouldnt be in the same cache-line(->64 Byte)
    queue_node_t *tail; // producer adds at tail
} queue_t;

and if you have really enough memory to waste, you could do something like this

typedef struct queue_node_t {
    int data;
    struct queue_node_t *next;
    char cache_pad[56]; // sizeof(queue_node_t) == 64; only for 32Bit
} queue_node_t;

Memory Barriers in Single Producer Single Consumer Queue

1 Answers