4
votes

I need to optimize a set of algorithms based on in-memory tables for certain processor. I found myself wondering why every Intel processor uses 64KB (32KB data, 32KB instruction) of L1 cache per core since at least 2010.

Why do they stick with 64KB even if every other cache increases along with almost gigantic L3 caches were introduced?

Is there anything I can read about this?

Is there a valid guessing if this ever increase within the next 5 or 10 years?

I checked other vendors and Opterons for instance come with 64KB + 64KB but was shared per module and the Interlagos (for instance) had just 16KB per core and 64KB Data cache shared per module. A7 and A8 of Apple got 64KB + 64KB per core but other vendors used 64bit Arm with 16KB + 16KB.

Currently I design with 8KB tables but once I have to mix two tables together this is becoming even more important.

2
AFAIK, Intel cpus have virtually tagged L1 caches, for speed (you can do the cached access in parallel with the page-table lookup). For virtually tagged caches, the size is sets*associativity, where sets is usually the page size of the system, which for x86 is part of the ISA, AFAIK. Increasing associativity is expensive, and virtually tagged caches need to be flushed on context-switch, so a bigger cache may not even be desirable.EOF
Processor design is an exercise in balancing latency. The bigger the cache, the longer the signal pathways, the slower it needs to be clocked.Hans Passant
The faster the memory, the larger area it needs. For example the typical SRAM has 6 transistors but there are also types with 8 or 10 transistors. OTOH DRAM needs only 1 transistor per bit so they took much less area but the speed may be hundreds of times slower than SRAMphuclv
@MartinKersten: Given that a latch already takes two nor-gates (not transistors), and a flip-flop is more complicated than a latch (a flip-flop can be constructed from two latches), I have to ask you how you propose to construct an SRAM-bit from two transistors. I'm sure chipmakers around the world are eager to license your pending patent.EOF
This should go to electronics SE. It has nothing to do with programming or assembly language.Bregalad

2 Answers

3
votes

L1i and L1d need low latency and (for L1d) need multiple read/write ports. L1d also need to support unaligned load/store for any width from byte to 32-byte. (Or 64-byte on CPUs with AVX512). Keeping these caches small is important for maintaining those properties, and keeping power in check.

Being small also makes VIPT (Virtually Indexed, Physically Tagged) easier, which is essential to minimize latency. (Fetch tags+data in parallel with the TLB lookup of the high bits of the address.)

See Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? for more details about these factors.

Spending your power budget elsewhere (other than L1i / L1d) is more valuable beyond a certain point. e.g. on better OoO exec, more load/store buffer entries, or on a much larger per-core private L2 that's somewhat fast but doesn't need multiple read/write ports, and doesn't need to support unaligned byte accesses: that's the key change that lets L1d stay small while shared L3 gets huge.


Fun fact: for Ice Lake, Intel finally bumped up L1d cache from 32k to 48k by increasing associativity from 8 to 12 (maintaining VIPT "for free" without aliasing problems).

This is the first L1 increase for Intel since Pentium-M, which bumped up to 32k + 32k from 16k + 16k in Pentium 3. (And from trace cache + 16k L1d in Pentium 4).

Between P-M and Skylake-X, Intel has greatly improved bandwidth between L1d and L2, and improved unaligned SIMD load/store, widened SIMD load/store data paths to 64 bytes, up from 8, and added another cache read port. (Haswell and later can do 2 reads and 1 write per cycle).

OTOH, AMD has experimented with different L1 configurations over the years, but for Zen has settled on the same good design as Intel. (32k, good associativity, per-core private L2 cache backing it up so L1d misses aren't a disaster and don't have to hit shared caches.)

See also

1
votes

I'm no expert but my two cents:

L1 is integrated to core which means: it shares same clock and its size effects the size of core.

First one is more a logical problem. You want L1 to be very very fast just barely slow compared to registers. You can't solve this by clocking up L1 since core gets clocked up as well. HW caches are similar to software caches and it takes time to search through them. So when L1 gets bigger, search becomes slower given the sophistication of HW cache solution stays same. You can increase the sophistication of solution but this will have a negative effect on space, energy and heat.

Continuing on size, that's if you make L1 bigger you need space to store those bits and bytes creating the same space, energy problem.

So you have different design criteria for L1 and L2 and by making them separate you divide the problem and conquer it at two levels. If you make L1 big and slow as L2 then you blur that.

Readings: