The cache line in core 1's L1D cache is still in Shared state
This is the part of scenario that violates MESI. The store can't commit until the RFO sent by core 2 has completed, so core 1 has the line in Invalid state.
In your example it wouldn't really be an "intermediate" step, though. Without synchronization, there's no way to distinguish your impossible scenario from simply having the load by core 1 happen before the line was invalidated. i.e. core 1's load can appear before core 2's store in the global order.
Stores don't become globally visible until well after they execute locally (they have to retire and then the store queue can commit them to L1D), and x86's memory model allows StoreLoad reordering, so stores can be delayed (kept hidden in the private store queue) until after later loads by core 2 become globally visible. (See Jeff Preshing's Memory Barriers Are Like Source Control Operations for more background on memory reordering, and what StoreLoad reordering means).
In MESI (and all variants like MESIF or MOESI), if one cache has a line in E or M state, no other caches can have a copy of that line. The state table in the MESI wikipedia article makes this perfectly clear: if one cache has E or M state, the others all have Invalid.
It's never possible for two caches to both have valid copies of a line with differing data. This is what it means for caches to be coherent, and stopping that from happening is the whole point of the MESI protocol.
If a core wants to modify a cache line, it takes Exclusive ownership of the line so no other cores can observe stale values. This has to happen before a store can commit into L1D. Store queues exist to hide the latency of the Read-For-Ownership (among other things), but data in the store queue is not yet committed to L1D. (Related: What happens when different CPU cores write to the same RAM address without synchronization? has more about the store queue).
And BTW, lets assume that [mem]
is naturally-aligned so loads / stores to it are atomic (as guaranteed by the x86 architecture). Why is integer assignment on a naturally aligned variable atomic on x86?.
Multi-level caches and Modified lines
With multi-level caches, dirty cache lines can propagate up the hierarchy. So a line can be in Modified state in L1D and L2 of the same core. This is fine because write-back from L1D goes through L2.
As I understand it, the shared inclusive L3 cache in Intel CPUs doesn't have to write-back to DRAM before it can share out copies of the cache line to multiple cores. So as far as normal / simple MESI is concerned, think of L3 as the backing store, not DRAM.
Making this work on multi-socket systems is tricky; I'm not sure if things are set up so the L3 in a socket can only cache physical addresses that correspond to DRAM attached to that socket. In any case, snoop requests are sent between sockets on L3 cache miss, and there are lots of complicated settings you can configure to tweak this on a Xeon system. (See an Anandtech article about Haswell Xeon for example.)
[mem]
resides in a single cache line? It could easily be split over two, and get a different result. – Bo Persson