This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul
throughput as expected. But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al
has a dependency on the last imul
.
; synthetic micro-benchmark to test partial-register renaming
mov ecx, 1000000000
.loop: ; do{
imul eax, eax ; a dep chain with high latency but also high throughput
imul eax, eax
imul eax, eax
dec ecx ; set ZF, independent of old ZF. (Use sub ecx,1 on Silvermont/KNL or P4)
setnz al ; ****** Does this depend on RAX as well as ZF?
movzx eax, al
jnz .loop ; }while(ecx);
If setnz al
depends on rax
, the 3ximul/setcc/movzx sequence forms a loop-carried dependency chain. If not, each setcc
/movzx
/3ximul
chain is independent, forked off from the dec
that updates the loop counter. The 11c per iteration measured on HSW/SKL is perfectly explained by a latency bottleneck: 3x3c(imul) + 1c(read-modify-write by setcc) + 1c(movzx within the same register).
Off topic: avoiding these (intentional) bottlenecks
I was going for understandable / predictable behaviour to isolate partial-reg stuff, not optimal performance.
For example, xor
-zero / set-flags / setcc
is better anyway (in this case, xor eax,eax
/ dec ecx
/ setnz al
). That breaks the dep on eax on all CPUs (except early P6-family like PII and PIII), still avoids partial-register merging penalties, and saves 1c of movzx
latency. It also uses one fewer ALU uop on CPUs that handle xor-zeroing in the register-rename stage. See that link for more about using xor-zeroing with setcc
.
Note that AMD, Intel Silvermont/KNL, and P4, don't do partial-register renaming at all. It's only a feature in Intel P6-family CPUs and its descendant, Intel Sandybridge-family, but seems to be getting phased out.
gcc unfortunately does tend to use cmp
/ setcc al
/ movzx eax,al
where it could have used xor
instead of movzx
(Godbolt compiler-explorer example), while clang uses xor-zero/cmp/setcc unless you combine multiple boolean conditions like count += (a==b) | (a==~b)
.
The xor/dec/setnz version runs at 3.0c per iteration on Skylake, Haswell, and Core2 (bottlenecked on imul
throughput). xor
-zeroing breaks the dependency on the old value of eax
on all out-of-order CPUs other than PPro/PII/PIII/early-Pentium-M (where it still avoids partial-register merging penalties but doesn't break the dep). Agner Fog's microarch guide describes this. Replacing the xor-zeroing with mov eax,0
slows it down to one per 4.78 cycles on Core2: 2-3c stall (in the front-end?) to insert a partial-reg merging uop when imul
reads eax
after setnz al
.
Also, I used movzx eax, al
which defeats mov-elimination, just like mov rax,rax
does. (IvB, HSW, and SKL can rename movzx eax, bl
with 0 latency, but Core2 can't). This makes everything equal across Core2 / SKL, except for the partial-register behaviour.
The Core2 behaviour is consistent with Agner Fog's microarch guide, but the HSW/SKL behaviour isn't. From section 11.10 for Skylake, and same for previous Intel uarches:
Different parts of a general purpose register can be stored in different temporary registers in order to remove false dependences.
He unfortunately doesn't have time to do detailed testing for every new uarch to re-test assumptions, so this change in behaviour slipped through the cracks.
Agner does describe a merging uop being inserted (without stalling) for high8 registers (AH/BH/CH/DH) on Sandybridge through Skylake, and for low8/low16 on SnB. (I've unfortunately been spreading mis-information in the past, and saying that Haswell can merge AH for free. I skimmed Agner's Haswell section too quickly, and didn't notice the later paragraph about high8 registers. Let me know if you see my wrong comments on other posts, so I can delete them or add a correction. I will try to at least find and edit my answers where I've said this.)
My actual questions: How exactly do partial registers really behave on Skylake?
Is everything the same from IvyBridge to Skylake, including the high8 extra latency?
Intel's optimization manual is not specific about which CPUs have false dependencies for what (although it does mention that some CPUs have them), and leaves out things like reading AH/BH/CH/DH (high8 registers) adding extra latency even when they haven't been modified.
If there's any P6-family (Core2/Nehalem) behaviour that Agner Fog's microarch guide doesn't describe, that would be interesting too, but I should probably limit the scope of this question to just Skylake or Sandybridge-family.
My Skylake test data, from putting %rep 4
short sequences inside a small dec ebp/jnz
loop that runs 100M or 1G iterations. I measured cycles with Linux perf
the same way as in my answer here, on the same hardware (desktop Skylake i7 6700k).
Unless otherwise noted, each instruction runs as 1 fused-domain uop, using an ALU execution port. (Measured with ocperf.py stat -e ...,uops_issued.any,uops_executed.thread
). This detects (absence of) mov-elimination and extra merging uops.
The "4 per cycle" cases are an extrapolation to the infinitely-unrolled case. Loop overhead takes up some of the front-end bandwidth, but anything better than 1 per cycle is an indication that register-renaming avoided the write-after-write output dependency, and that the uop isn't handled internally as a read-modify-write.
Writing to AH only: prevents the loop from executing from the loopback buffer (aka the Loop Stream Detector (LSD)). Counts for lsd.uops
are exactly 0 on HSW, and tiny on SKL (around 1.8k) and don't scale with the loop iteration count. Probably those counts are from some kernel code. When loops do run from the LSD, lsd.uops ~= uops_issued
to within measurement noise. Some loops alternate between LSD or no-LSD (e.g when they might not fit into the uop cache if decode starts in the wrong place), but I didn't run into that while testing this.
- repeated
mov ah, bh
and/ormov ah, bl
runs at 4 per cycle. It takes an ALU uop, so it's not eliminated likemov eax, ebx
is. - repeated
mov ah, [rsi]
runs at 2 per cycle (load throughput bottleneck). - repeated
mov ah, 123
runs at 1 per cycle. (A dep-breakingxor eax,eax
inside the loop removes the bottleneck.) repeated
setz ah
orsetc ah
runs at 1 per cycle. (A dep-breakingxor eax,eax
lets it bottleneck on p06 throughput forsetcc
and the loop branch.)Why does writing
ah
with an instruction that would normally use an ALU execution unit have a false dependency on the old value, whilemov r8, r/m8
doesn't (for reg or memory src)? (And what aboutmov r/m8, r8
? Surely it doesn't matter which of the two opcodes you use for reg-reg moves?)repeated
add ah, 123
runs at 1 per cycle, as expected.- repeated
add dh, cl
runs at 1 per cycle. - repeated
add dh, dh
runs at 1 per cycle. - repeated
add dh, ch
runs at 0.5 per cycle. Reading [ABCD]H is special when they're "clean" (in this case, RCX is not recently modified at all).
Terminology: All of these leave AH (or DH) "dirty", i.e. in need of merging (with a merging uop) when the rest of the register is read (or in some other cases). i.e. that AH is renamed separately from RAX, if I'm understanding this correctly. "clean" is the opposite. There are many ways to clean a dirty register, the simplest being inc eax
or mov eax, esi
.
Writing to AL only: These loops do run from the LSD: uops_issue.any
~= lsd.uops
.
- repeated
mov al, bl
runs at 1 per cycle. An occasional dep-breakingxor eax,eax
per group lets OOO execution bottleneck on uop throughput, not latency. - repeated
mov al, [rsi]
runs at 1 per cycle, as a micro-fused ALU+load uop. (uops_issued=4G + loop overhead, uops_executed=8G + loop overhead). A dep-breakingxor eax,eax
before a group of 4 lets it bottleneck on 2 loads per clock. - repeated
mov al, 123
runs at 1 per cycle. - repeated
mov al, bh
runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H is special. xor eax,eax
+ 6xmov al,bh
+dec ebp/jnz
: 2c per iter, bottleneck on 4 uops per clock for the front-end.- repeated
add dl, ch
runs at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H apparently creates extra latency fordl
. - repeated
add dl, cl
runs at 1 per cycle.
I think a write to a low-8 reg behaves as a RMW blend into the full reg, like add eax, 123
would be, but it doesn't trigger a merge if ah
is dirty. So (other than ignoring AH
merging) it behaves the same as on CPUs that don't do partial-reg renaming at all. It seems AL
is never renamed separately from RAX
?
inc al
/inc ah
pairs can run in parallel.mov ecx, eax
inserts a merging uop ifah
is "dirty", but the actualmov
is renamed. This is what Agner Fog describes for IvyBridge and later.- repeated
movzx eax, ah
runs at one per 2 cycles. (Reading high-8 registers after writing full regs has extra latency.) movzx ecx, al
has zero latency and doesn't take an execution port on HSW and SKL. (Like what Agner Fog describes for IvyBridge, but he says HSW doesn't rename movzx).movzx ecx, cl
has 1c latency and takes an execution port. (mov-elimination never works for thesame,same
case, only between different architectural registers.)A loop that inserts a merging uop every iteration can't run from the LSD (loop buffer)?
I don't think there's anything special about AL/AH/RAX vs. B*, C*, DL/DH/RDX. I have tested some with partial regs in other registers (even though I'm mostly showing AL
/AH
for consistency), and have never noticed any difference.
How can we explain all of these observations with a sensible model of how the microarch works internally?
Related: Partial flag issues are different from partial register issues. See INC instruction vs ADD 1: Does it matter? for some super-weird stuff with shr r32,cl
(and even shr r32,2
on Core2/Nehalem: don't read flags from a shift other than by 1).
See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for partial-flag stuff in adc
loops.
mov al, 123
runs at 1 per cycle? butmovl eax, 123
repeated runs at 4cycles / iteration? Nevermind, its becausemov al, 123
is not dependency breaking. – Noah