18
votes

What is a speed of cache accessing for modern CPUs? How many bytes can be read or written from memory every processor clock tick by Intel P4, Core2, Corei7, AMD?

Please, answer with both theoretical (width of ld/sd unit with its throughput in uOPs/tick) and practical numbers (even memcpy speed tests, or STREAM benchmark), if any.

PS it is question, related to maximal rate of load/store instructions in assembler. There can be theoretical rate of loading (all Instructions Per Tick are widest loads), but processor can give only part of such, a practical limit of loading.

2
@osgx: Move to serverfault etc, not a programming question is it?TFD
@TFD, no, this is very programming related.Nikolai Fetissov
Consult "Analyzing Cache Bandwidth on the Intel Core 2 Architecture" by Robert Sch¨one, Wolfgang E. Nagel, and Stefan Pfl¨uger, Center for Information Services and High Performance Computing, Technische Universit¨at Dresden, 01062 Dresden, Germany In this paper, measured bandwidths between the computing cores and the different caches are presented. The STREAM benchmark1 is one of the most used kernels by scientists to determine the memory bandwidth. For deeper insight the STREAM benchmark was redesigned to get exact values for small problem sizes as well.osgx
So do you want to know the answer in "maximal rate of load/store instructions" or "bytes loaded/cycle"? The answer is quite different. Recent CPUs are limited more by instructions (e.g., 2 loads/cycle) than bytes (so a byte load and a 32-byte load have about the same cost), at least in cache levels close to the core. For DRAM it is more about cache lines/cycle: i.e., it doesn't matter if you load an entire line or 1 byte from it, it costs the same.BeeOnRope

2 Answers

10
votes

For nehalem: rolfed.com/nehalem/nehalemPaper.pdf

Each core in the architecture has a 128-bit write port and a
128-bit read port to the L1 cache. 

128 bit = 16 bytes / clock read AND 128 bit = 16 bytes / clock write (can I combine read and write in single cycle?)

The L2 and L3 caches each have a 256-bit port for reading or writing, 
but the L3 cache must share its port with three other cores on the chip.

Can L2 and L3 read and write ports be used in single clock?

Each integrated memory controller has a theoretical bandwidth
peak of 32 Gbps.

Latency (clock ticks), some measured by CPU-Z's latencytool or by lmbench's lat_mem_rd - both uses long linked list walk to correctly measure modern out-of-order cores like Intel Core i7

           L1     L2     L3, cycles;   mem             link
Core 2      3     15     --           66 ns           http://www.anandtech.com/show/2542/5
Core i7-xxx 4     11     39          40c+67ns         http://www.anandtech.com/show/2542/5
Itanium     1     5-6    12-17       130-1000 (cycles)
Itanium2    2     6-10   20          35c+160ns        http://www.7-cpu.com/cpu/Itanium2.html
AMD K8            12                 40-70c +64ns     http://www.anandtech.com/show/2139/3
Intel P4    2     19     43          200-210 (cycles) http://www.arsc.edu/files/arsc/phys693_lectures/Performance_I_Arch.pdf
AthlonXP 3k 3     20                 180 (cycles)     --//--
AthlonFX-51 3     13                 125 (cycles)     --//--
POWER4      4     12-20  ??          hundreds cycles  --//--
Haswell     4     11-12  36          36c+57ns         http://www.realworldtech.com/haswell-cpu/5/    

And good source on latency data is 7cpu web-site, e.g. for Haswell: http://www.7-cpu.com/cpu/Haswell.html

More about lat_mem_rd program is in its man page or here on SO.

7
votes

Widest read/writes are 128 bit (16 byte) SSE load/store. L1/L2/L3 caches have different bandwidths and latencies and these are of course CPU-specific. Typical L1 latency is 2 - 4 clocks on modern CPUs but you can usually issue 1 or 2 load instructions per clock.

I suspect there's a more specific question lurking here somewhere - what is it that you are actually trying to achieve ? Do you just want to write the fastest possible memcpy ?