How multilevel CPU caches having the same cache line size work?

Question

Note: I'm not sure if StackOverflow is the correct place for that question or if there is a more suitable StackExchange sub for this

I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size. I can totally undrestand how this works (or at least I think so) when used with quite simple architectures. Then I came accross this question. Question is how cache memories of the same cache line can cooperate?

This is how I percieve the way of cache memories with different cache line size work. For simplicity, lets suppose there are no different caches for data and for instructions and we only have L1 and L2 caches (L3 and L4 not exist). If L1 has cache line size of 64 bytes and L2 of 128 bytes, when we have cache miss on L2 and we need to fetch the desired byte or word from main memory, we also bring its closest bytes or words in order to fill the 128 bytes of the L2 cache line. Then because of the locality of the references to memory locations produced by the processor we have higher probability of geting a hit on L2 whe missing on L1. But if we had equal cache line sizes this of course wouldn't happen, with the previous algorithm. Can you explain me some sort/simple algorithm or implementation of how modern CPUs take advantage of caches having the same line size?

Thanks in advance.

Peter Cordes Peter Cordes · Accepted Answer · 2017-11-15T03:49:35

I've read in a book, that for multilevel CPU caches, cache line size increases as per level's total memory size.

That's not true for most CPUs. Usually the line size is the same in all caches, but the total size increases. Often also the associativity, but usually not by as much as the total size, so the number of sets typically increases.

The point of multi-level caches is to get low latency and large size without needing a single cache that's both large and low latency (because that's physically impossible).

HW prefetch into L2 and/or L1 is what makes sequential read work well, not larger line size in out levels of cache. (And in multi-core CPUs, private L1/L2 + shared L3 provide private latency + bandwidth filters for the memory workload hits the shared domain, but then you have L3 as a coherency backstop instead of hitting DRAM for data that's shared between cores.)

Having different line sizes in different caches is more complicated, especially in a multi-core system where caches have to maintain coherency with each other using MESI. Transferring around whole cache between caches works well.

But if if L1D lines are 64B and private L2 / shared L3 lines are 128B, then a load on one core might force the L2 cache to request both halves separately in case separate cores had each of the two halves of the 128B line modified. Sounds really complicated, and puts more logic into the outer-level cache.

(Paul Clayton's answer on the question you linked points out that a possible solution to that problem is separate validity bits for the two halves of a larger cache line, or even separate MESI coherency state. But still sharing the same tag, so if they are both valid then they have to be caching two halves of the same 128B block.)

How multilevel CPU caches having the same cache line size work?

1 Answers