It's known that x86 architecture doesn't implement sequential consistency memory model because of usage of write buffers, so that store->load reordering can take place (later loads can be committed while the earlier stores still reside in write buffers waiting for the commit to L1 cache).
In A Primer on Memory Consistency and Coherence we can read about Read-Modify-Write(RMW) operations in Total Store Order(TSO) memory consistency model (which is supposed to be very similar to x86):
... we consider the RMW as a load immediately followed by a store. The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer, but this is not legal. If the load part of the RMW passes an earlier store, then the store part of the RMW would also have to pass the earlier store because the RMW is an atomic pair. But because stores are not allowed to pass each other in TSO, the load part of the RMW cannot pass an earlier store either.
Ok, atomic operation must be atomic, i.e. the memory location accessed by RMW can't be accessed by another threads/cores during the RMW operation, but what, if the earlier store passes by load part of the atomic operation is not related to the memory location accessed by RMW? Assume we have the following couple of instructions (in pseudocode):
store int32 value in 0x00000000 location
atomic increment int32 value in 0x10000000 location
The first store is added to the write buffer and is waiting for its turn. Meanwhile, the atomic operation loads the value from another location (even in another cache line), passing the first store, and adds store into the write buffer next after the first one. In global memory order we'll see the following order:
load (part of atomic) -> store (ordinal) -> store (part of atomic)
Yes, maybe it's not a best solution from the performance point of view, since we need to hold the cache line for the atomic operation in read-write state until all preceding stores from the write buffer are committed, but, performance considerations aside, are there any violations of TSO memory consistency model is we allow for the load part of RMW operation to pass the earlier stores to unrelated locations?
lock
prefix, which, among other things, can hold cache line in M state during the execution of the atomic instruction. Once the instruction is retired, the lock is released, so, yes, placing the store part of RMW operation in the write buffer can violate the atomicity of the operation, since from the time the store was placed till the time it's written to cache any other core can access the old value. So it particularly gives the answer to my question, though it is rather an implementation detail than a conceptual limitation of TSO. – undermind