27
votes

I'm trying to decide between two algorithms. One writes 8 bytes (two aligned 4-byte words) to 2 cache lines, the other writes 3 entire cache lines.

If the CPU writes only the changed 8 bytes back to memory, then the first algorithm uses much less memory bandwidth: 8 bytes vs 192 bytes. If the CPU writes entire cache lines, then the difference between 128 and 192 bytes is less striking.

So how does a Intel Xeon CPU write back to memory? You'd be surprised how hard it is to find an answer in Google to something that should be well known.

As I understand it, the writes go into the store buffer, and then to the cache. They might only get written to memory when the dirty cache line is evicted from the cache, but does Intel track which parts of the cache line are dirty, or just dump the entire thing? I rather doubt that they track things below cache line granularity. I would also be very surprised if anything goes to memory before the cache line is evicted.

2
Why the downvotes? This is a great question!Stephan Dollberg
@inf One problem with the question is the use of the term "Intel Xeon CPU" doesn't make a useful distinction here. The Xeon trademark has been applied to Intel x86 CPUs since Pentium II architecture. Technically it doesn't really denote a different kind a processor, so much as it denotes a different kind of customer that the processors are marketed towards. By limiting the question just "enterprise-class" CPUs, it's less useful than one that just asked about Intel x86 CPUs generally. The answer is going to be the same either way.Ross Ridge
@RossRidge Well, then simply ask for clarification to which architecture he is referring and don't go on a downvote spree.Stephan Dollberg
It seems that your main goal is to decide between two algorithms (based on performance). Is there a good reason not to just benchmark both algorithms? It might be a bit more work but it's guaranteed to give you exactly the information you need to make the choice.Hugh Allen
@RossRidge I'm not old enough to remember Xeon's based on the Pentium II. I'll restrict my interest to Sandy Bridge and newer CPUs, since in a cloud services world, that's about as old as you'll find. I used Xeon in the title because more people know what a Xeon is than what a Sandy Bridge is.Eloff

2 Answers

18
votes

Locality matters even for DRAM itself, even discounting caching. A burst write of 64B contiguous bytes for a dirty cache-line is a lot faster than 16 writes of 4B to 16 different addresses. Or to put it another way, writing back an entire cache line is not much slower than writing back just a few changed bytes in a cache line.

What Every Programmer Should Know About Memory, by Ulrich Drepper, explains a lot of stuff about avoiding memory bottlenecks when programming. He includes some details of DRAM addressing. DRAM controllers have to select a row, and then select a column. Accessing another virtual memory page can also cause a TLB miss.

DRAM does have a burst-transfer command for transferring a sequential chunk of data. (Obviously designed for the benefit of CPUs writing back cache lines). The memory system in modern computers is optimized for the usage-pattern of writing whole cache lines, because that's what almost always happens.

Cache lines are the unit at which CPUs track dirty-or-not. It would be possible to track dirtyness with a smaller line size than present-or-not cache lines, but that would take extra transistors and isn't worth it. The multiple levels of cache are set up to transfer whole cache lines around, so they can be as fast as possible when a whole cache line needs to be read.

There are so-called non-temporal reads/writes (movnti/movntdqa) that bypass the cache. These are for use with data that won't be touched again until it would have been evicted from the cache anyway (hence the non-temporal). They are a bad idea for data that could benefit from caching, but would let you write 4 bytes to memory, rather than a whole cache line. Depending on the MTRR for that memory range, the write might or might not be subject to write-combining. (This is relevant for memory-mapped i/o regions, where two adjacent 4B writes isn't the same as one 8B write.)

The algorithm that only touches two cache lines certainly has the advantage on that score, unless it takes a lot more computation, or especially branching, to figure out which memory to write. Maybe ask a different question if you want help deciding. (see the links at https://stackoverflow.com/tags/x86/info, esp Agner Fog's guides, for info that will help you decide for yourself.)

See Cornstalks' answer for warnings about the dangers of having multiple threads on different CPUs touching the same memory. This can lead to bigger slowdowns than just extra writes for a single-threaded program.

9
votes

In order for the CPU to write only the dirty bytes back to memory, it would need to store a dirty bit for every byte in the cache. That's infeasible and isn't done on modern CPUs (as far as I know). CPUs only have one dirty bit for a cache line. Writing to any byte in the cache line causes the whole line to be marked as dirty.

When it comes time to flush the dirty cache line, the whole line needs to be written, because the CPU has no idea which byte(s) changed.

This can be seen in cache invalidation policies in which writing to one cache line in a core can invalidate the cache line in a different core (because the two cache lines map to the same address), even if the first core is using the lo-half of the cache line and the second core is using the high-half of the cache line. That is, if core 1 writes to byte N, and core 2 is using byte N+1, then core 2 still has to refresh its cache line even though you and I know that it's not necessary.