12
votes

This question is specifically aimed at modern x86-64 cache coherent architectures - I appreciate the answer can be different on other CPUs.

If I write to memory, the MESI protocol requires that the cache line is first read into cache, then modified in the cache (the value is written to the cache line which is then marked dirty). In older write-though micro-architectures, this would then trigger the cache line being flushed, under write-back the cache line being flushed can be delayed for some time, and some write combining can occur under both mechanisms (more likely with writeback). And I know how this interacts with other cores accessing the same cache-line of data - cache snooping etc.

My question is, if the store matches precisely the value already in the cache, if not a single bit is flipped, does any Intel micro-architecture notice this and NOT mark the line as dirty, and thereby possibly save the line from being marked as exclusive, and the writeback memory overhead that would at some point follow?

As I vectorise more of my loops, my vectorised-operations compositional primitives don't explicitly check for values changing, and to do so in the CPU/ALU seems wasteful, but I was wondering if the underlying cache circuitry could do it without explicit coding (eg the store micro-op or the cache logic itself). As shared memory bandwidth across multiple cores becomes more of a resource bottleneck, this would seem like an increasingly useful optimisation (eg repeated zero-ing of the same memory buffer - we don't re-read the values from RAM if they're already in cache, but to force a writeback of the same values seems wasteful). Writeback caching is itself an acknowledgement of this sort of issue.

Can I politely request holding back on "in theory" or "it really doesn't matter" answers - I know how the memory model works, what I'm looking for is hard facts about how writing the same value (as opposed to avoiding a store) will affect the contention for the memory bus on what you may safely assume is a machine running multiple workloads that are nearly always bound by memory bandwidth. On the other hand an explanation of precise reasons why chips don't do this (I'm pessimistically assuming they don't) would be enlightening...

Update: Some answers along the expected lines here https://softwareengineering.stackexchange.com/questions/302705/are-there-cpus-that-perform-this-possible-l1-cache-write-optimization but still an awful lot of speculation "it must be hard because it isn't done" and saying how doing this in the main CPU core would be expensive (but I still wonder why it can't be a part of the actual cache logic itself).

Update (2020): Travis Downs has found evidence of Hardware Store Elimination but only, it seems, for zeros and only where the data misses L1 and L2, and even then, not in all cases. His article is highly recommended as it goes into much more detail.... https://travisdowns.github.io/blog/2020/05/13/intel-zero-opt.html

Update (2021): Travis Downs has now found evidence that this zero store optimisation has recently been disabled in microcode... more detail as ever from the source himself https://travisdowns.github.io/blog/2021/06/17/rip-zero-opt.html

3
The answers on softwareengineering.stackexchange.com/questions/302705/… are mostly terrible, especially the currently accepted one shows a lack of understanding of caches / CPU registers.Peter Cordes

3 Answers

7
votes

Currently no implementation of x86 (or any other ISA, as far as I know) supports optimizing silent stores.

There has been academic research on this and there is even a patent on "eliminating silent store invalidation propagation in shared memory cache coherency protocols". (Googling '"silent store" cache' if you are interested in more.)

For x86, this would interfere with MONITOR/MWAIT; some users might want the monitoring thread to wake on a silent store (one could avoid invalidation and add a "touched" coherence message). (Currently MONITOR/MWAIT is privileged, but that might change in the future.)

Similarly, such could interfere with some clever uses of transactional memory. If the memory location is used as a guard to avoid explicit loading of other memory locations or, in an architecture that supports such (such was in AMD's Advanced Synchronization Facility), dropping the guarded memory locations from the read set.

(Hardware Lock Elision is a very constrained implementation of silent ABA store elimination. It has the implementation advantage that the check for value consistency is explicitly requested.)

There are also implementation issues in terms of performance impact/design complexity. Such would prohibit avoiding read-for-ownership (unless the silent store elimination was only active when the cache line was already present in shared state), though read-for-ownership avoidance is also currently not implemented.

Special handling for silent stores would also complicate implementation of a memory consistency model (probably especially x86's relatively strong model). Such might also increase the frequency of rollbacks on speculation that failed consistency. If silent stores were only supported for L1-present lines, the time window would be very small and rollbacks extremely rare; stores to cache lines in L3 or memory might increase the frequency to very rare, which might make it a noticeable issue.

Silence at cache line granularity is also less common than silence at the access level, so the number of invalidations avoided would be smaller.

The additional cache bandwidth would also be an issue. Currently Intel uses parity only on L1 caches to avoid the need for read-modify-write on small writes. Requiring every write to have a read in order to detect silent stores would have obvious performance and power implications. (Such reads could be limited to shared cache lines and be performed opportunistically, exploiting cycles without full cache access utilization, but that would still have a power cost.) This also means that this cost would fall out if read-modify-write support was already present for L1 ECC support (which feature would please some users).

I am not well-read on silent store elimination, so there are probably other issues (and workarounds).

With much of the low-hanging fruit for performance improvement having been taken, more difficult, less beneficial, and less general optimizations become more attractive. Since silent store optimization becomes more important with higher inter-core communication and inter-core communication will increase as more cores are utilized to work on a single task, the value of such seems likely to increase.

6
votes

It's possible to implement in hardware, but I don't think anybody does. Doing it for every store would either cost cache-read bandwidth or require an extra read port and make pipelining harder.

You'd build a cache that did a read/compare/write cycle instead of just write, and could conditionally leave the line in Exclusive state instead of Modified (of MESI). Doing it this way (instead of checking while it was still Shared) would still invalidate other copies of the line, but that means there's no interaction with memory-ordering. The (silent) store becomes globally visible while the core has Exclusive ownership of the cache line, same as if it had flipped to Modified and then back to Exclusive by doing a write-back to DRAM.

The read/compare/write has to be done atomically (you can't lose the cache line between the read and the write; if that happened the compare result would be stale). This makes it harder to pipeline data committing to L1D from the store queue.


In a multi-threaded program, it can be worth doing this as an optimization in software for shared variables only.

Avoiding invalidating everyone else's cache can make it worth converting

shared = x;

into

if(shared != x)
    shared = x;

I'm not sure if there are memory-ordering implications here. Obviously if the shared = x never happens, there's no release-sequence, so you only have acquire semantics instead of release. But if the value you're storing is often what's already there, any use of it for ordering other things will have ABA problems.

IIRC, Herb Sutter mentions this potential optimization in part 1 or 2 of his atomic Weapons: The C++ Memory Model and Modern Hardware talk. (A couple hours of video)

This is of course too expensive to do in software for anything other than shared variables where the cost of writing them is many cycles of delay in other threads (cache misses and memory-order mis-speculation machine clears: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)


Related: See this answer for more about x86 memory bandwidth in general, especially the NT vs. non-NT store stuff, and "latency bound platforms" for why single-threaded memory bandwidth on many-core Xeons is lower than on a quad-core, even though aggregate bandwidth from multiple cores is higher.

5
votes

I find evidence that some modern x86 CPUs from Intel, including Skylake and Ice Lake client chips, can optimize redundant (silent) stores in at least one specific case:

  • An all zero cache line is overwritten fully or partially with more zeros.

That is, a "zeros over zeros" scenario.

For example, this chart shows the performance (the circles, measured on the left axis) and relevant performance counters for a scenario where a region of varying size is filed with 32-bit values of either zero or one, on Ice Lake:

Ice Lake Fill Performance

Once the region no longer fits in the L2 cache, there is a clear advantage for writing zeroes: the fill throughput is almost 1.5x higher. In the case of zeros, we also see that the evictions from L2 are not almost all "silent", indicating that no dirty data needed to written out, while in the other case all evictions are non-silent.

Some miscellaneous details about this optimization:

  • It optimizes the write-back of the dirty cache line, not the RFO which still needs to occur (indeed, the read is probably needed to decide that the optimization can be applied).
  • It seems to occur around the L2 or L2 <-> L3 interface. That is, I don't find evidence of this optimization for loads that fit in L1 or L2.
  • Because the optimization takes effect at some point outside the innermost layer of the cache hierarhcy, It is not necessary to only write zeros to take advantage: it is enough that the line contains all zeros only once it is written back to the L3. So starting with an all-zero line, you can do any amount of non-zero writes, followed by a final zero-write of the entire line1, as long as the line does not escape to the L3 in the meantime.
  • The optimization has varying performance effects: sometimes the optimization is occurring based on observation of relevant perf counts, but there is almost no increased throughput. Other times the impact can be very large.
  • I don't find evidence of the effect in Skylake server or earlier Intel chips.

I wrote this up in more detail here, and there is an addendum for Ice Lake, which exhibits this effect more strongly here.

Update, June 2021: This optimization has been disabled in the newest CPU microcode versions provided by Intel, for security reasons (details).


1 Or, at least overwrite the non-zero parts of the line with zeros.