Atomic memory ordering performance difference

Question

I wrote small test to check performance difference of atomic load with different memory ordering and I found that performance is the same for relaxed and sequential consistent memory ordering. Is it happening just due to sub-optimal compiler implementation or this is result that I can expect in general on x86 processors? I use compiler gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3). I compiled my test with optimization -O2 (that is why second test with simple variable shows zero execution time).

Results:
Start volatile tests with 1000000000 iterations
volatile test took 689438 microseconds. Last value of local var is 1
Start simple var tests with 1000000000 iterations
simple var test took 0 microseconds. Last value of local var is 2
Start relaxed atomic tests with 1000000000 iterations
relaxed atomic test took 25655002 microseconds. Last value of local var is 3
Start sequentially consistent atomic tests with 1000000000 iterations
sequentially consistent atomic test took 24844000 microseconds. Last value of local var is 4

This is test functions:

std::atomic<int> atomic_var;
void relaxed_atomic_test(const unsigned iterations)
{
    cout << "Start relaxed atomic tests with " << iterations << " iterations" << endl;
    const microseconds start(std::chrono::system_clock::now().time_since_epoch());
    int local_var = 0;
    for(unsigned counter = 0; iterations != counter; ++counter)
    {
        local_var = atomic_var.load(memory_order_relaxed);
    }
    const microseconds end(std::chrono::system_clock::now().time_since_epoch());
    cout << "relaxed atomic test took " << (end - start).count()
         << " microseconds. Last value of local var is " << local_var << endl;
}

void sequentially_consistent_atomic_test(const unsigned iterations)
{
    cout << "Start sequentially consistent atomic tests with "
         << iterations << " iterations" << endl;
    const microseconds start(std::chrono::system_clock::now().time_since_epoch());
    int local_var = 0;
    for(unsigned counter = 0; iterations != counter; ++counter)
    {
        local_var = atomic_var.load(memory_order_seq_cst);
    }
    const microseconds end(std::chrono::system_clock::now().time_since_epoch());
    cout << "sequentially consistent atomic test took " << (end - start).count()
         << " microseconds. Last value of local var is " << local_var << endl;
}

UPDATE: I tried the same tests but instead read I used write into atomic variable. Results are quite different - write into memory_order_relaxed atomic took the same time as write into volatile:

Start volatile tests with 1000000000 iterations
volatile test took 764088 microseconds. Last volatile_var value 999999999
Start simple var tests with 1000000000 iterations
simple var test took 0 microseconds. Last var value999999999
Start relaxed atomic tests with 1000000000 iterations
relaxed atomic test took 763968 microseconds. Last atomic_var value 999999999
Start sequentially consistent atomic tests with 1000000000 iterations
sequentially consistent atomic test took 15287267 microseconds. Last atomic_var value 999999999

So I can conclude that in single thread atomic with relaxed memory ordering behaves as volatile for store operation and as atomic with sequential consistent memory ordering for load operation (using this processor and compiler)

Matthew G. Matthew G. · Accepted Answer · 2014-01-23T04:07:31

x86 is a relatively strict architecture with respect to memory consistently, so you're likely going to see similar performance between the two. You'd see a bigger difference on a architecture that allows more reordering like POWER.

Atomic memory ordering performance difference

1 Answers