Real CPUs don't use a shared bus; traffic goes through an L3 cache whose tags work as a snoop filter (especially in single-socket Intel chips). Or similar ways of saving traffic on other microarchitectures. You're right that actually broadcasting a message to every other core would be prohibitively expensive for power, and performance as you scale to many cores. A shared bus is only a simple mental model for protocols like MESI, not the real implementation in modern CPUs. See What cache coherence solution do modern x86 CPUs use? for example.
Write-back caches with write-allocate need to read a cache line before you store into it, so they have the original data for the other parts of the line. This read, when triggered by a write, is called a "read for ownership" (RFO) to get the line into MESI Exclusive state (which can be converted to dirty Modified without external traffic). RFO includes the invalidation.
If the initial access was read-only, the line typically arrives in Exclusive state like an RFO, if no other core had a cached copy (i.e. it missed in L3 (last-level) cache). This means that traffic stays at a minimum for the common pattern of reading some private data and then modifying it.
A multi-socket system would have to snoop the other socket or consult snoop filters to determine this, I think, but the most power/energy-sensitive systems are mobile (always single-socket).
Fun fact: Intel 2-socket Xeon chips before Skylake-X (e.g. E5 ...-v4) didn't have snoop filters for traffic between sockets, and did just spam snoops at the other socket across the QPI link. E7 CPUs (capable of being used in quad and larger systems) had dedicated snoop filter caches to track state of hot lines, as well as enough QPI links to cross-connect more sockets. source: John McCalpin's post on an Intel forum, although I haven't been able to find much other data. Perhaps John was thinking of earlier systems like Core2 / Nehalem Xeons where Intel does talk about having snoop filters, e.g.
https://www.intel.ca/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf compares QPI to their earlier setups. And has some more details about snooping modes that can trade off latency vs. throughput. Maybe Intel just doesn't use the term "snoop filter" the same way.
Is there a way to do it the other way around, to indicate to the CPU that a given cache line will never be of interest to any other thread?
You can skip RFOs if you have a cache-write protocol that combines the store data with the invalidation. e.g. x86 has NT stores that bypass cache, and apparently fast-strings stores (rep stos
/ rep movs
) even before ERMSB can also use a no-RFO write protocol (at least in P6, according to Andy Glew who designed it), even though they leave their data in the cache hierarchy. That does still require invalidation of other caches, though, unless this core already owns the lines in E or M state. Enhanced REP MOVSB for memcpy
Some CPUs do have some scratchpad memory which is truly private to each core. It's not shared at all, so no explicit flushing is needed or possible. See Dr. Bandwidth's answer on Can you directly access the cache using assembly? - this is apparently common on DSPs.
But other than that, generally no, CPUs don't provide a way to treat parts of memory address space as non-coherent. Coherency is a guarantee that CPU don't want to let software disable. (Perhaps because it could create security problems, e.g. if some old writes could eventually became visible in a page of file data after an OS had checksummed it, but before DMA to disk, unprivileged user-space could cause a checksumming FS like BTRFS or ZFS to see bad blocks in a file it did mmap(PROT_WRITE|PROT_READ, MAP_SHARED)
on.)
Normally memory barriers work by simply making the current core wait until the store buffer has drained into L1d cache (i.e. prior stores have become globally visible), so if you allowed non-coherent L1d then some other mechanism would be needed for flushing it. (e.g. x86 clflush
or clwb
to force write-back to outer caches.)
Creating ways for most software to take advantage of this would be hard; e.g. it's assumed that you can take the address of a local var and pass it to other threads. And even in a single-threaded program, any pointer might have come from mmap(MAP_SHARED)
. So you can't default to mapping stack space as non-coherent or anything like that, and compiling programs to use extra flush instructions in case they get a pointer into non-coherent memory that does need to be visible after all would just totally defeat the purpose of the whole thing.
So part of the reason this isn't worth pursuing is that it's extra complication that everything all the way up the stack would have to care about to make this efficient. Snoop filters and directory-based coherence are a sufficient solution to the problem, and overall much better than expecting everyone to optimize their code for this low-level feature!