Reducing bus traffic for cache line invalidation

Question

Shared-memory multiprocessing systems typically need to generate a lot of traffic for cache coherence. Core A writes to cache. Core B might later read the same memory location. Therefore, core A, even if it would otherwise have avoided writing to main memory as yet, needs to send a notification to core B, to tell B to invalidate that address if it is holding it in cache.

Exactly when this needs to be done, is a complicated question. Different CPU architectures have different memory models, where a memory model in this context is a set of guarantees about what order things will be observed to happen in. The weaker the memory model, the more relaxed A can be about exactly when it sends the notification to B, the easier it is for A and B to do more things in parallel. A good summary of memory models for different CPU architectures: https://en.wikipedia.org/wiki/Memory_ordering#Runtime_memory_ordering

All the discussion seems to be about when the invalidation happens, what order things happen in.

But it seems to me that in many workloads, most of the data written by A, will never be used by B, so it would be better if the bus traffic for those cache invalidations could be entirely eliminated. The hardware dedicated to performing cache coherence still needs to exist because A and B will sometimes need to share data, but writing to a shared bus is one of the more energy-intensive things a CPU can do, and battery life and heat dissipation are often limiting resources these days, so reducing bus traffic would be a useful optimization. Is there a way to do this?

The ideal scenario from an efficiency perspective would be if the omission of bus traffic were the default (because most written data is not shared with other threads) and you had to explicitly issue a memory barrier where you wanted cache coherence. On the other hand, that might be impossible because of the volume of existing code that assumes it's running on x86 or ARM; is there a way to do it the other way around, to indicate to the CPU that a given cache line will never be of interest to any other thread?

I would be interested in answers for any system, but most particularly for the most common present/future server configuration of Linux on x64, ARM or RISC-V.

Peter Cordes Peter Cordes · Accepted Answer · 2020-06-27T21:52:37

Real CPUs don't use a shared bus; traffic goes through an L3 cache whose tags work as a snoop filter (especially in single-socket Intel chips). Or similar ways of saving traffic on other microarchitectures. You're right that actually broadcasting a message to every other core would be prohibitively expensive for power, and performance as you scale to many cores. A shared bus is only a simple mental model for protocols like MESI, not the real implementation in modern CPUs. See What cache coherence solution do modern x86 CPUs use? for example.

Write-back caches with write-allocate need to read a cache line before you store into it, so they have the original data for the other parts of the line. This read, when triggered by a write, is called a "read for ownership" (RFO) to get the line into MESI Exclusive state (which can be converted to dirty Modified without external traffic). RFO includes the invalidation.

If the initial access was read-only, the line typically arrives in Exclusive state like an RFO, if no other core had a cached copy (i.e. it missed in L3 (last-level) cache). This means that traffic stays at a minimum for the common pattern of reading some private data and then modifying it.

A multi-socket system would have to snoop the other socket or consult snoop filters to determine this, I think, but the most power/energy-sensitive systems are mobile (always single-socket).

Fun fact: Intel 2-socket Xeon chips before Skylake-X (e.g. E5 ...-v4) didn't have snoop filters for traffic between sockets, and did just spam snoops at the other socket across the QPI link. E7 CPUs (capable of being used in quad and larger systems) had dedicated snoop filter caches to track state of hot lines, as well as enough QPI links to cross-connect more sockets. source: John McCalpin's post on an Intel forum, although I haven't been able to find much other data. Perhaps John was thinking of earlier systems like Core2 / Nehalem Xeons where Intel does talk about having snoop filters, e.g. https://www.intel.ca/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf compares QPI to their earlier setups. And has some more details about snooping modes that can trade off latency vs. throughput. Maybe Intel just doesn't use the term "snoop filter" the same way.

Is there a way to do it the other way around, to indicate to the CPU that a given cache line will never be of interest to any other thread?

You can skip RFOs if you have a cache-write protocol that combines the store data with the invalidation. e.g. x86 has NT stores that bypass cache, and apparently fast-strings stores (rep stos / rep movs) even before ERMSB can also use a no-RFO write protocol (at least in P6, according to Andy Glew who designed it), even though they leave their data in the cache hierarchy. That does still require invalidation of other caches, though, unless this core already owns the lines in E or M state. Enhanced REP MOVSB for memcpy

Some CPUs do have some scratchpad memory which is truly private to each core. It's not shared at all, so no explicit flushing is needed or possible. See Dr. Bandwidth's answer on Can you directly access the cache using assembly? - this is apparently common on DSPs.

But other than that, generally no, CPUs don't provide a way to treat parts of memory address space as non-coherent. Coherency is a guarantee that CPU don't want to let software disable. (Perhaps because it could create security problems, e.g. if some old writes could eventually became visible in a page of file data after an OS had checksummed it, but before DMA to disk, unprivileged user-space could cause a checksumming FS like BTRFS or ZFS to see bad blocks in a file it did mmap(PROT_WRITE|PROT_READ, MAP_SHARED) on.)

Normally memory barriers work by simply making the current core wait until the store buffer has drained into L1d cache (i.e. prior stores have become globally visible), so if you allowed non-coherent L1d then some other mechanism would be needed for flushing it. (e.g. x86 clflush or clwb to force write-back to outer caches.)

Creating ways for most software to take advantage of this would be hard; e.g. it's assumed that you can take the address of a local var and pass it to other threads. And even in a single-threaded program, any pointer might have come from mmap(MAP_SHARED). So you can't default to mapping stack space as non-coherent or anything like that, and compiling programs to use extra flush instructions in case they get a pointer into non-coherent memory that does need to be visible after all would just totally defeat the purpose of the whole thing.

So part of the reason this isn't worth pursuing is that it's extra complication that everything all the way up the stack would have to care about to make this efficient. Snoop filters and directory-based coherence are a sufficient solution to the problem, and overall much better than expecting everyone to optimize their code for this low-level feature!

Reducing bus traffic for cache line invalidation

1 Answers