L1i and L1d need low latency and (for L1d) need multiple read/write ports. L1d also need to support unaligned load/store for any width from byte to 32-byte. (Or 64-byte on CPUs with AVX512). Keeping these caches small is important for maintaining those properties, and keeping power in check.
Being small also makes VIPT (Virtually Indexed, Physically Tagged) easier, which is essential to minimize latency. (Fetch tags+data in parallel with the TLB lookup of the high bits of the address.)
See Why is the size of L1 cache smaller than that of the L2 cache in most of the processors? for more details about these factors.
Spending your power budget elsewhere (other than L1i / L1d) is more valuable beyond a certain point. e.g. on better OoO exec, more load/store buffer entries, or on a much larger per-core private L2 that's somewhat fast but doesn't need multiple read/write ports, and doesn't need to support unaligned byte accesses: that's the key change that lets L1d stay small while shared L3 gets huge.
Fun fact: for Ice Lake, Intel finally bumped up L1d cache from 32k to 48k by increasing associativity from 8 to 12 (maintaining VIPT "for free" without aliasing problems).
This is the first L1 increase for Intel since Pentium-M, which bumped up to 32k + 32k from 16k + 16k in Pentium 3. (And from trace cache + 16k L1d in Pentium 4).
Between P-M and Skylake-X, Intel has greatly improved bandwidth between L1d and L2, and improved unaligned SIMD load/store, widened SIMD load/store data paths to 64 bytes, up from 8, and added another cache read port. (Haswell and later can do 2 reads and 1 write per cycle).
OTOH, AMD has experimented with different L1 configurations over the years, but for Zen has settled on the same good design as Intel. (32k, good associativity, per-core private L2 cache backing it up so L1d misses aren't a disaster and don't have to hit shared caches.)
See also
sets*associativity
, wheresets
is usually the page size of the system, which for x86 is part of the ISA, AFAIK. Increasing associativity is expensive, and virtually tagged caches need to be flushed on context-switch, so a bigger cache may not even be desirable. – EOF