I understand that intel uses home snooping coherency protocol in QPI and perhaps something more complex/dynamic (workload-specific) in UPI. But if a cache line is in I (INVALID) state to begin with while none of the other cores have it in their L1/L2, once the cache line is requested from home agent will the load request be also broadcasted to other local cores? I believe it does. However, will the load request be broadcasted to cores on a different node also?
Another possible explanation is: If not found in L2 then the L3 memory controller will be asked for it. The LLC controller will know which DIMM/core has the physical data requested (using a directory) and routes the request to the corresponding core via QPI/UPI. Next, the request is broadcasted amongst the cores in target node only by its L3 controller. Finally, the L2 controller will be informed about inter-node communication so L2 won't broadcast to other local cores. This implies requests are never broadcasted beyond a node.
I understand that this kind of information might not be available publically but any ideas are appreciated.