16
votes

I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues:

The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops.

What is a dedicated stack pointer tracker actually?

For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops

What does it mean in fact? Please make it clear.

2

2 Answers

18
votes
  1. Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8 / rsp-=8 part of push/pop / call/ret in the issue stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core).

    So the OoO execution part of the core only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp when the 8bit displacement counter overflows, or when the OoO core needs the value of rsp directly (e.g. sub rsp, 8, or mov [rsp-8], eax after a call, ret, push or pop typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).

    Note that Agner's instruction tables show that Pentium-M and later decode pop reg to a single uop which runs only on the load port. But Pentium II/III decodes pop eax to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the out-of-order core. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp, or an address for mov eax, [esp+16].


  1. The P6 microarch family (PPro to Nehalem) stored the input values for a uop directly in the ROB. At issue/rename, "cold" register inputs are read from the architectural register file into the ROB (which can be a bottleneck, due to limited read ports. See register-read stalls). After executing a uop, the result is written into the ROB for other uops to read. The architectural register file is updated with values from the ROB when uops retire.

    SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.

Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.

SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OOO core.

0
votes

The stack machine is kind of like another execution/memory port. As Fog says:

The modification of the stack pointer by PUSH, POP, CALL and RET instructions is done by a special stack engine. ... This relieves the pipeline from the burden of μops that modify the stack pointer.

So that's taking care of the rsp+=8 / rsp-=8 arithmetic. They get handled by the stack machine without competing for execution port resources. But there's more.

The 16 deep hardware return address stack (Section 3.4.1.4 of the Intel® 64 and IA-32 Architectures Optimization Reference Manual) is a fast shadow of the return addresses. It showed up in Pentium M. It is also used return prediction. Search Fog's Microarchitecture doc for "return stack buffer" for a little but not a lot more.

So now you have some nice HW to reduce execution port contention for stack arithmetic and a fast cache return address values. You can make the stack machine's life difficult by trying to outsmart it. Basically, always match calls/rets and pushes and pops. Then you're good to go.