Oh, the mind-bending horror of weak memory ordering...
The first snippet is your basic atomic read-modify-write - if someone else touches whatever address x1
points to, the store-exclusive will fail and it will try again until it succeeds. So far so good. However, this only applies to the address (or more rightly region) covered by the exclusive monitor, so whilst it's good for atomicity, it's ineffective for synchronisation of anything other than that value.
Consider a case where CPU1 is waiting for CPU0 to write some data to a buffer. CPU1 sits there waiting on some kind of synchronisation object (let's say a semaphore), waiting for CPU0 to update it to signal that new data is ready.
- CPU0 writes to the data address.
- CPU0 increments the semaphore (atomically, as you do) which happens to be elsewhere in memory.
- ???
- CPU1 sees the new semaphore value.
- CPU1 reads some data, which may or may not be the old data, the new data, or some mix of the two.
Now, what happened at step 3? Maybe it all occurred in order. Quite possibly, the hardware decided that since there was no address dependency it would let the store to the semaphore go ahead of the store to the data address. Maybe the semaphore store hit in the cache whereas the data didn't. Maybe it just did so because of complicated reasons only those hardware guys understand. Either way it's perfectly possible for CPU1 to see the semaphore update before the new data has hit memory, thus read back invalid data.
To fix this, CPU0 must have a barrier between steps 1 and 2, to ensure the data has definitely been written before the semaphore is written. Having the atomic write be a barrier is a nice simple way to do this. However since barriers are pretty performance-degrading you want the lightweight no-barrier version as well for situations where you don't need this kind of full synchronisation.
Now, the even less intuitive part is that CPU1 could also reorder its loads. Again since there is no address dependency, it would be free to speculate the data load before the semaphore load irrespective of CPU0's barrier. Thus CPU1 also needs its own barrier between steps 4 and 5.
For the more authoritative, but pretty heavy going, version have a read of ARM's Barrier Litmus Tests and Cookbook. Be warned, this stuff can be confusing ;)
As an aside, in this case the architectural semantics of acquire/release complicate things further. Since they are only one-way barriers, whilst OSAtomicAdd32Barrier
adds up to a full barrier relative to code before and after it, it doesn't actually guarantee any ordering relative to the atomic operation itself - see this discussion from Linux for more explanation. Of course, that's from the theoretical point of view of the architecture; in reality it's not inconceivable that the A7 hardware has taken the 'simple' option of wiring up LDAXR
to just do DMB+LDXR
, and so on, meaning they can get away with this since they're at liberty to code to their own implementation, rather than the specification.