4
votes

I like to know more details of MESI in intel broadwell .

Suppose A cpu socket has 6 cores core 0 to core 5 , each of them has their own L1$ and L2$ and share L3$ , there are a var X in shared memory , x located in cache line called XCacheL , the following is the detail for my question:

T1 : Core 0 and core 4 and core 5 has x = 100 and XCacheL is Shared status since 3 cores has the copy of XCacheL .

T2 : Core 0 require to modify x , so core 0 broadcast invalidate signal and core 4 and core 5 receive the signal ,invalidate their copy of XCacheL , Core 0 modify x to 200 and XCacheL status now is Modified .

T3: core 4 require to read x but its XCacheL copy is invalidated in T2 , so it fire a read miss , the following is going to happen :

● Processor makes bus request to memory
● Snooping cache puts copy value on the bus
● Memory access is abandoned
● Local processor caches value
● Local copy tagged S
● Source (M) value copied back to memory
● Source value M -> S

so after T3 , XCacheL is core 0 and core 4 status : Shared , and Invalidated in core 5 , and also L3$ and main memory has the newest valid XCacheL .

T4 : core 5 require to read x , since its XCacheL copy is Invalidated in T2 , but this monent XCacheL has the correct copy in L3$ , Would core 5 need to fire a read miss like core 4 do ?!

My guess is : no need , since L3$ has the valid XCacheL, so core 5 can reach L3$ and get the right XCacheL from L3$ to L1$ in core 5 , so core 5 won't fire a read miss .

2
Where the L3 is inclusive, it is probably faster to read the shared lines from there. Where it isn't they are forwarded from the other caches. That's why MESIF exists. The uncore probably just broadcast the request in the QPI/UPI link and either the L3, the iMC or another core homing agent respond to it. It that's what you mean by a read miss (sorry I lack terminology) than a core will still fire it. Actually, you always need to fire something to read from outside the core, even from L1.Margaret Bloom
Transitioning from Modified directly to Shared upon read is not done on all processors. Sometimes it's good to invalidate because the read will soon become a write and you want the line exclusively. see - software.intel.com/en-us/forums/…Leeor

2 Answers

2
votes

It looks like you're talking about the Early Snoop algorithm where the caching agents of the L3 slices are responsible for sending snoops. So I'll answer the question according to that algorithm.

All Broadwell processors use an inclusive L3. So yes, core 5 will miss in its private L1 and L2 caches and a read request is sent to the caching agent of the L3 slice to which the requested line is mapped. The caching agent determines that it has the line and it is in the S state. Since it is a read request, the caching agent will send the cache line to core 5. The state of the line is not changed and no snoops are sent.

2
votes

You're right, in your T4 step, core #5's load will hit in L3, so no memory access happens. Core #5 gets another copy of the line, in Shared state.


Your sequence of steps makes zero sense for a CPU like Broadwell where all cores share access to on-chip DRAM controller(s).

A ring bus connects cores (each of which has a slice of L3 cache) and the System Agent (PCIe links and connection to other cores) and Home Agent (memory controllers). See https://en.wikichip.org/wiki/intel/microarchitectures/broadwell_(client)#Die_Stats for a block diagram showing the ring bus.

Individual cores don't directly drive "the memory bus", or even one of the 2 or 4 DRAM buses. The memory controller arbitrates access to DRAM, and has some buffering to reorder / combine accesses. (Everything that accesses memory goes through it, including DMA, so it can do whatever it likes as long as it gives the appearance of loads/stores happening in some sane order.)

A load request won't be sent to the system agent until after it misses in L3 cache. See https://superuser.com/questions/1226197/x86-address-space-controller/1226198#1226198 for an illustration of a quad-core desktop (which is simpler and just has the memory controller connected to the System Agent, making it exactly like a Northbridge before CPUs integrated the memory controllers.)


Since Broadwell uses an inclusive L3 cache, L3 tags can tell it which, if any, core has a Modified or Exclusive copy, even if the line in L3 itself isn't shareable. (i.e. a line's data can be Invalid in L3, but the tags are still tracking which core has a private copy). See Which cache mapping technique is used in intel core i7 processor?

This lets L3 tags act as a snoop filter to reduce broadcasts.