8
votes

Let's assume that 2 cores are trying to write different values to the same RAM address (1 byte), at the same moment of time (plus-minus eta), and without using any interlocked instructions or memory barriers. What happens in this case and what value will be written to the main RAM? The first one wins? The last one wins? Undetermined behavior?

2
If it's only one byte, then somebody will win. It won't be undetermined in the sense of garbage that's neither thread wrote to it.Mysticial
The terms "first" and "last" have no meaning in an unsynchronized program. It cannot be observed, you'd only find out afterwards with no guarantee that the code is going to behave to same the second time. It has to be enforced, that requires sync. Unless you define "eta" as least as large as the OS' hard fault response time and scheduling latency. Which only have an upper bound on an RTOS. Nobody ever waits that long, so it is UB.Hans Passant
last one to complete a transaction will win, but the race is not visible, not a case of last processor to START the transaction will win, but the last transaction from any of the masters to be processed by the ram controller in question, will be the one visible from that point until another write transaction happens.old_timer
In the good old days and this doesnt mean there are designs right now with this problem as there are, if two transactions happened to hit "at the same time" (one comes in during the multi-clock cycle period where a transaction is completing) the latter would be discarded. Video flickering/blinkies on the early/original pc. if the video scan was reading from memory when software was trying to talk to that memory one would lose and that character/pixel would come up wrong for that horizontal scan.old_timer
as far as you are concerned it is undetermined...one will win, but it is not deterministic to you as to who will win on each instance.old_timer

2 Answers

14
votes

x86 (like every other mainstream SMP CPU architecture) has coherent data caches. It's impossible for two difference caches (e.g. L1D of 2 different cores) to hold conflicting data for the same cache line.

The hardware imposes an order (by some implementation-specific mechanism to break ties in case two requests for ownership arrive in the same clock cycle from different cores). In most modern x86 CPUs, the first store won't be written to RAM, because there's a shared write-back L3 cache to absorb coherency traffic without a round-trip to memory.

Loads that appear after both the stores in the global order will see the value stored by whichever store went second.


(I'm assuming we're talking about normal (not NT) stores to cacheable memory regions (WB, not USWC, UC, or even WT). The basic idea would be the same in either case, though; one store would go first, the next would step on it. The data from the first store could be observed temporarily if a load happened to get between them in the global order, but otherwise the data from the store that the hardware chose to do 2nd would be the long-term effect.

We're talking about a single byte, so the store can't be split across two cache lines, and thus every address is naturally aligned so everything in Why is integer assignment on a naturally aligned variable atomic on x86? applies.


Coherency is maintained by requiring a core to acquire exclusive access to that cache line before it can modify it (i.e. make a store globally visible by committing it from the store queue to L1D cache).

This "acquiring exclusive access" stuff is done using (a variant of) the MESI protocol. Any given line in a cache can be Modified (dirty), Exclusive (owned by not yet written), Shared (clean copy; other caches may also have copies so an RFO (Read / Request For Ownership) is required before write), or Invalid. MESIF (Intel) / MOESI (AMD) add extra states to optimize the protocol, but don't change the fundamental logic that only one core can change a line at any one time.

If we cared about ordering of multiple changes to two different lines, then memory ordering an memory barriers would come into play. But none of that matters for this question about "which store wins" when the stores execute or retire in the same clock cycle.

When a store executes, it goes into the store queue. It can commit to L1D and become globally visible at any time after it retires, but not before; unretired instructions are treated as speculative and thus their architectural effects must not be visible outside the CPU core. Speculative loads have no architectural effect, only microarchitectural1.

So if both stores become ready to commit at "the same time" (clocks are not necessarily synchronized between cores), one or the other will have its RFO succeed first and gain exclusive access, and make its store data globally visible. Then, soon after, the other core's RFO will succeed and update the cache line with its data, so its store comes second in the global store order observed by all other cores.

x86 has a total-store-order memory model where all cores observe the same order even for stores to different cache lines (except for always seeing their own stores in program order). Some weakly-ordered architectures like PowerPC would allow some cores to see a different total order from other cores, but this reordering can only happen between stores to different lines. There is always a single modification order for a single cache line. (Reordering of loads with respect to each other and other stores means that you have to be careful how you go about observing things on a weakly ordered ISA, but there is a single order of modification for a cache line, imposed by MESI).

Which one wins the race might depend on something as prosaic as the layout of the cores on the ring bus relative to which slice of shared L3 cache that line maps to. (Note the use of the word "race": this is the kind of race which "race condition" bugs describe. It's not always wrong to write code where two unsynchronized stores update the same location and you don't care which one wins, but it's rare.)

BTW, modern x86 CPUs have hardware arbitration for the case when multiple cores contend for atomic read-modify-write to the same cache line (and thus are holding onto it for multiple clock cycles to make lock add byte [rdi], 1 atomic), but regular loads/stores only need to own a cache line for a single cycle to execute a load or commit a store. I think the arbitration for locked instructions is a different thing from which core wins when multiple cores are trying to commit stores to the same cache line. Unless you use a pause instruction, cores assume that other cores aren't modifying the same cache line, and speculatively load early, and thus will suffer memory-ordering mis-speculation if it does happen. (What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

IDK if anything similar happens when two threads are both just storing without loading, but probably not because stores aren't speculatively reordered and are decoupled from out-of-order execution by the store queue. Once a store instruction retires, the store is definitely going to happen, so OoO exec doesn't have to wait for it to actually commit. (And in fact it has to retirem from the OoO core before it can commit, because that's how the CPU knows it's non-speculative; i.e. that no earlier instruction faulted or was a mispredicted branch)


Footnotes:

  1. Spectre blurs that line by using a cache-timing attack to read microarchitectural state into the architectural state.
4
votes

They will wind up being sequenced, likely between the L1 caches. One write will come first and the other will come second. Whichever one comes second will be the result that subsequent reads will see.