4
votes

It's known that x86 architecture doesn't implement sequential consistency memory model because of usage of write buffers, so that store->load reordering can take place (later loads can be committed while the earlier stores still reside in write buffers waiting for the commit to L1 cache).

In A Primer on Memory Consistency and Coherence we can read about Read-Modify-Write(RMW) operations in Total Store Order(TSO) memory consistency model (which is supposed to be very similar to x86):

... we consider the RMW as a load immediately followed by a store. The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer, but this is not legal. If the load part of the RMW passes an earlier store, then the store part of the RMW would also have to pass the earlier store because the RMW is an atomic pair. But because stores are not allowed to pass each other in TSO, the load part of the RMW cannot pass an earlier store either.

Ok, atomic operation must be atomic, i.e. the memory location accessed by RMW can't be accessed by another threads/cores during the RMW operation, but what, if the earlier store passes by load part of the atomic operation is not related to the memory location accessed by RMW? Assume we have the following couple of instructions (in pseudocode):

store int32 value in 0x00000000 location
atomic increment int32 value in 0x10000000 location

The first store is added to the write buffer and is waiting for its turn. Meanwhile, the atomic operation loads the value from another location (even in another cache line), passing the first store, and adds store into the write buffer next after the first one. In global memory order we'll see the following order:

load (part of atomic) -> store (ordinal) -> store (part of atomic)

Yes, maybe it's not a best solution from the performance point of view, since we need to hold the cache line for the atomic operation in read-write state until all preceding stores from the write buffer are committed, but, performance considerations aside, are there any violations of TSO memory consistency model is we allow for the load part of RMW operation to pass the earlier stores to unrelated locations?

2
If you are using an instruction pair( load linked- store conditional) to implement the atomic increment operation, I cant see anything wrong with your suggested order. However, if it's a single instruction, then it's not possible as the load part of atomic becomes a micro op, and we are trying to mix ops and micro ops, probably not a good idea.Isuru H
@IsuruH On x86 it's a single instruction. But what could be wrong with such mixing? Micro load op doesn't wait for the previous stores and takes the value from cache, while micro store op just places the result in the write buffer.undermind
@IsuruH On x86 RMW operations are implemented with lock prefix, which, among other things, can hold cache line in M state during the execution of the atomic instruction. Once the instruction is retired, the lock is released, so, yes, placing the store part of RMW operation in the write buffer can violate the atomicity of the operation, since from the time the store was placed till the time it's written to cache any other core can access the old value. So it particularly gives the answer to my question, though it is rather an implementation detail than a conceptual limitation of TSO.undermind
Thanks !! your comment and @Leeor answer explains why this cannot be done. However in my head it sounds, technically you can allow a store to a different cache line to be drained between read and write of the atomic operation. My knowledge on micro ops is bit limited, so I am not sure how you would reorder parts of an instruction, to me reordering happen at instruction level.Isuru H
@IsuruH AFAIK, such "reordering" may happen even without actual reordering of the instructions by the CPU. Even if you have a scalar CPU with a single pipeline and in-order commit, all you need is to load values immediately from the cache or write buffer (if it contains recent stores to the needed location), but to push stores to write buffer, thus delaying them. In such case the global order of Store->Load memory operations will be changed even if they are micro-ops.undermind

2 Answers

5
votes

You could ask the same question about any store + load pair to different addresses: the load may be executed earlier internally than the older store due to out-of-order execution. In X86 this would be allowed, because:

Loads may be reordered with older stores to different locations but not with older stores to the same location

(source: Intel 64 Architecture Memory Ordering White Paper)

However, in your example, the lock perfix would prevent that, because (from the same set of rules):

Locked instructions have a total order

This means that the lock would enforce a memory barrier, like an mfence (and indeed some compilers use a locked operation as a fence). This will usually make the CPU stop the execution of the load until the store buffer has drained, forcing the store to execute first.

0
votes

since we need to hold the cache line for the atomic operation in read-write state until all preceding stores from the write buffer are committed, but, performance considerations aside

If you hold a lock L while you do operations S that are of same nature as those prevented by L, that is there exist S' that can be blocked (delayed) by L and S can be blocked (delayed) by L', then you have the recipe for a deadlock, unless you are guaranteed to be the only actor doing that (which would make the whole atomic thing pointless).