How do fences atomize load-modify-store operations?

Question

I have read, putting a fence instruction after a load-modify-store one, like BTS, makes you can treat the second one atomic. But according to the Intel's documentation, the fence instructions are described as

(MFENCE)

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction.

So, how does such behavior guarantee the mentioned "atomicity"?

Specifically, if we have two simultaneous runs of the following code performed by distinct processors, how would the fence prevent from reading 0 into CF in both cases?

start memory assumption: [addr] contains the word 0

BTS WORD PTR [addr], 0
MFENCE

Can you post a link to what you quote? fences would enforce ordering with regards to the same thread (the rely on program order). On a multithreaded system this is not enough to achieve atomicity — Leeor
I thought so. I've read about using them to atomize at some mailing lists. The posts were old and I do not think, they came from really serious guys, so maybe no one was taking multi-processors machines at consideration. — Clare

Leeor Leeor · Accepted Answer · 2013-09-27T21:44:22

Pushing some fences in is not enough to grant atomicity.

For a single threaded code there's not real benefit to them, the CPU would know to order the loads and stored internally to achieve correct execution as it the core ran serially (even though in reality, most modern CPUs would run it out if order).

The benefit of fences might come in scenarios such as this -

thread1:                    |         thread 2:
    store [x],1             |             store [y],1
    load [y] -> r1          |             load [x] -> r2

This is a classic example for memory consistency issues - the possible outcomes the programmer would expect if reading the 2 registers would be 1,1 (both stores occurred first, then both loads), or 1,0 or 0,1 (if one of the threads ran ahead of the other. What you don't expect is 0,0, since at least one of the threads should have done the write. However, with relaxed memory ordering this might be possible - the loads are done early along the pipe, and the stores are very late. As there's no intra-thread aliasing in the addresses (assume x!=y), there's nothing the CPU does to prevent that.

Adding fences as below would guarantee that if one of the threads reached the load, the preceding store must have been dispatches and observed. This means that you can still get 0,1 and 1,0 (if both store-fence-load complete in one thread first), and of course 1,1, but you can't have 0,0 anymore.

thread1:                    |         thread 2:
    store [x],1             |             store [y],1
    mfence                  |             mfence
    load [y] -> r1          |             load [x] -> r2

See also - http://bartoszmilewski.com/2008/11/05/who-ordered-memory-fences-on-an-x86/

However, you requested atomicity - this is stronger, let's take your example -

BTS WORD PTR [addr], 0
MFENCE

If we replicate it into 2 threads, it's essentially like before, except that the fence goes after the load and store (the fact that they're grouped into the same instruction doesn't change the basic operations done). What's to stop you from doing both reads first, reading 0 on both threads, and then doing the stores (which would involve some MESI-state race in your caches, as both threads would compete for ownership if they're on different cores), but will ultimately result in both stores writing to that line. Then you can go perform the mfences all you want, that's not going to save you from the already broken atomicity.

What would guarantee atomicity is a good old decent lock. The threads won't be able to simultaneously share the line even for the reads that way. It's usually considered a slow but necessary evil, but some modern CPUs may even optimize them away in HW! See - http://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions

EDIT: After searching a bit, I believe what led to this question is related to how atomic keyword is defined in c++11. These links - Concurrency: Atomic and volatile in C++11 memory model and http://bartoszmilewski.com/2008/12/01/c-atomics-and-memory-ordering/ , indicate that some of the implementations are done through pushing mfences after the store. However, I don't think this pretends to imply any regular (non-library) operation done on an atomic variable is bound to be atomic. Anyway, this mechanism is supposed to provide multiple memory consistency models, so we'll need to be more specific here

EDIT2: There does seem to be a large "movement" (not sure how to call them :) trying to reduce the necessity of locks, here's an interesting piece: http://preshing.com/20120612/an-introduction-to-lock-free-programming/ . This is mostly about SW design and being able to differentiate the real potential data races, but the bottom line seems to be that there will always be some locks required. The c++11 additions, while making life easier for a given consistency model and removing the need for the programmer to implement HW specific solution, might still be forced to fall into the old solution. Quote: Be aware that the C++11 atomic standard does not guarantee that the implementation will be lock-free on every platform.

How do fences atomize load-modify-store operations?

1 Answers