3
votes

i'm trying to learn assembly and in the book I'm reading I came across functional units and their latencies shown in tables in the textbook.

I was wondering what are the functional units of my CPU and what are the latencies? integer addition, integer multiplication, single precision addition, single precision multiplication and double precision multiplication.

My CPU is AMD Ryzen 5 3600

I've checked out these links: https://www.amd.com/en/technologies/zen-core-3 https://en.wikichip.org/wiki/amd/microarchitectures/zen_3

but couldn't find anything about functional units in my processor or its latencies.

example of latency table from the book:

example of the latency table from the book

functional units information example for intel core i7 Haswell:

example of functional units in intel core i7 Haswell

Any help is appreciated, thank you!! :)

1
I think this is all going to be in the cpu manuals that Intel and AMD put out, if they choose to include this information. This is not exactly what you're looking for because it's not specific to one single cpu, but it is the most detailed document about Intel's x86 architecture. Might have information you'd find useful.wxz
In some cases this information is proprietary, but third-party reverse engineering is able to produce a good guess. Agner Fog's optimization manuals are one widely used source; see volumes 3 and 4. It's also sort of implicit in the data generated at uops.info, where the "Ports" column infers which functional units exist and which instructions use them.Nate Eldredge
(I was working on some edits when you accepted the answer; have a look if you didn't see the last edit; I'm done updating it for the moment.)Peter Cordes
@NateEldredge yep Agner's is a pretty good reference as well, thanks!Megan Darcy

1 Answers

5
votes

Zen 3 is only an incremental change from Zen 2, so Wikichip didn't repeat the architecture details section. See https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram.

For latencies, on https://uops.info/ you can see which ALU instructions are single-uop, and what the measured latencies are. Unless there's inherent bypass-latency as part of the only way to use an instruction (e.g. possibly in pmovmskb), those are the same latencies as the underlying functional units.


For earlier microarchitectures, including Intel from Core 2 through Haswell and AMD K8 / K10 / Bulldozer, David Kanter wrote up some really nice micro-architectural deep dives.

Reading these, especially Sandy-bridge and Haswell, will be helpful in understanding Zen (because there are similarities). But note that Zen can decode even a memory-destination add [rdi], eax as a single front-end uop, unlike Intel where the required load and store operations are separate uops that have to be micro-fused to squeeze through the front-end without taking extra bandwidth.

But Bulldozer is like Zen in having separate scheduler queues for integer vs. FP execution units. Unlike Intel, they don't share "ports" between integer and FP, so those operations don't compete with each other in the back-end.


Your book says Haswell has 8 "functional units"

That's not quite true. Intel CPUs group execution units onto different ports, but that doesn't mean all the execution units connected to / through one port are physically part of one big "functional unit" or execution unit.

For example, Andy Glew (one of the architects of Intel's P6 microarchitecture) commented on What is the "EU" in x86 architecture? (calculates effective address?), saying "I did not get into the complexity of groups of specialized EUs sharing start ports and completion ports, let alone RF read and write ports, flexible latencies, etc. it was hard enough to explain those issues in the Intel compiler writer's guide, when I wrote the first version for P6 circa 1994."

Kanter's diagrams for SnB and HSW on https://www.realworldtech.com/haswell-cpu/4/ show this

enter image description here

For example: port 1 has three separate (groups of) execution units connected through it:

  • Integer ALU (including support for 3-cycle latency operations like imul and popcnt, unlike the integer ALU on any other port)
  • SIMD FP ALU, with FMA/MULPS/PD (5 cycle latency fully pipelined) and separately ADDPS (3 cycle latency). They compete for write-back / completion ports as well, so the schedule will try to avoid starting an ADDPS on port 1 two cycles after a MULPS.
  • SIMD integer ALU including blend, VPADDB, etc.

These three execution units (or groups of execution units?) are part of separate forwarding domains (hence bypass latency if you do a SIMD-integer shift on the output of a SIMD-FP mulps, e.g. to extract the exponent field). It's likely the the FP ALUs are physically close to the FP register file, separate from the integer units. Having separate "domains" also keeps a handle on the combinatorial explosion of what might need to forward to what, and also simply the fan-out for signals. (If lots of things need to read the same bus, it takes a stronger signal to drive its voltage to logic-1 or logic-0 with all the capacitive load.)

Skylake dropped the separate SIMD-FP-add ALU, and just runs it on the FMA hardware with the same latency as fma/mul. It's likely that SIMD-FP add was truly a separate execution unit in Haswell, not just a different configuration of the FMA unit, otherwise you'd expect that they would have done that with the FMA units on both port 0 and 1. But addps only has 1/clock throughput on Haswell. (Related: Why does Intel's Haswell chip allow floating point multiplication to be twice as fast as addition?)

I don't know if the integer ALU on port 1 competes with the SIMD stuff for write-back. Possibly not, since integer and FP have different register files. They do need to mark the uop as being done executing in the ROB (ReOrder Buffer), though, and the ROB is unified. (The uop can leave the RS (scheduler) soon after dispatch to an execution port, though; that doesn't need to wait for completion, only to know that its data really was ready as expected so it doesn't need to get replayed. That can happen if it was reading the result of a load, and the load turned out not to hit in cache so wasn't ready with the expected latency.)


Fortunately, that port vs. EU distinction is mostly just "fun fact"

For performance, you really just need to know the numbers from https://uops.info/, and which uops compete for execution ports / units with each other. Not whether addps and fma...ps actually use the same transistors. (And https://agner.org/optimize/ and vendor optimization manuals to understand the details of the pipeline feeding work to those execution units, and bypass latencies between them.)

However, it's certainly interesting to know how CPUs work. And it's occasionally relevant to understanding how different models of the same CPU family differ:

Skylake-X (supporting AVX-512) has an interesting effect: when 512-bit uops are in flight, it shuts down the SIMD ALUs on port 1, connecting them into a 512-bit FMA unit that handles uops from port 0.

But it doesn't shut down the integer ALUs: that's the only place popcnt / imul / lzcnt / slow-LEA can execute, and it can still run 1-cycle simple integer stuff too. This is a really clear-cut example of execution units being separate from ports, merely reached through them.

(Many Skylake-AVX512 CPUs have a second 512-bit FMA unit connected to port 5 they can power up for 512-bit uops. Some Xeon Bronze / Silver don't. Ice Lake laptop and Rocket Lake chips don't; 512-bit FP add/mul/FMA has 1/clock throughput instead of 1 per 0.5 clocks. https://www.extremetech.com/computing/263963-intel-reverses-declares-skylake-x-cpus-two-avx-512-units has a short article about Skylake-X high-end desktop chips, describing the mechanism.)

Agner Fog also covered that port 1 stuff, and the fact that there are only two vector ALU ports active when any 512-bit uops are in flight, in his microarchitecture guide.