This loop runs at one iteration per 3 cycles on Intel Conroe/Merom, bottlenecked on imul throughput as expected.  But on Haswell/Skylake, it runs at one iteration per 11 cycles, apparently because setnz al has a dependency on the last imul.
; synthetic micro-benchmark to test partial-register renaming
    mov     ecx, 1000000000
.loop:                 ; do{
    imul    eax, eax     ; a dep chain with high latency but also high throughput
    imul    eax, eax
    imul    eax, eax
    dec     ecx          ; set ZF, independent of old ZF.  (Use sub ecx,1 on Silvermont/KNL or P4)
    setnz   al           ; ****** Does this depend on RAX as well as ZF?
    movzx   eax, al
    jnz  .loop         ; }while(ecx);
If setnz al depends on rax, the 3ximul/setcc/movzx sequence forms a loop-carried dependency chain.  If not, each setcc/movzx/3ximul chain is independent, forked off from the dec that updates the loop counter.  The 11c per iteration measured on HSW/SKL is perfectly explained by a latency bottleneck: 3x3c(imul) + 1c(read-modify-write by setcc) + 1c(movzx within the same register).
Off topic: avoiding these (intentional) bottlenecks
I was going for understandable / predictable behaviour to isolate partial-reg stuff, not optimal performance.
For example, xor-zero / set-flags / setcc is better anyway (in this case, xor eax,eax / dec ecx / setnz al).  That breaks the dep on eax on all CPUs (except early P6-family like PII and PIII), still avoids partial-register merging penalties, and saves 1c of movzx latency.  It also uses one fewer ALU uop on CPUs that handle xor-zeroing in the register-rename stage.  See that link for more about using xor-zeroing with setcc.
Note that AMD, Intel Silvermont/KNL, and P4, don't do partial-register renaming at all. It's only a feature in Intel P6-family CPUs and its descendant, Intel Sandybridge-family, but seems to be getting phased out.
gcc unfortunately does tend to use cmp / setcc al / movzx eax,al where it could have used xor instead of movzx (Godbolt compiler-explorer example), while clang uses xor-zero/cmp/setcc unless you combine multiple boolean conditions like count += (a==b) | (a==~b).
The xor/dec/setnz version runs at 3.0c per iteration on Skylake, Haswell, and Core2 (bottlenecked on imul throughput).  xor-zeroing breaks the dependency on the old value of eax on all out-of-order CPUs other than PPro/PII/PIII/early-Pentium-M (where it still avoids partial-register merging penalties but doesn't break the dep).  Agner Fog's microarch guide describes this.  Replacing the xor-zeroing with mov eax,0 slows it down to one per 4.78 cycles on Core2: 2-3c stall (in the front-end?) to insert a partial-reg merging uop when imul reads eax after setnz al.
Also, I used movzx eax, al which defeats mov-elimination, just like mov rax,rax does.  (IvB, HSW, and SKL can rename movzx eax, bl with 0 latency, but Core2 can't).  This makes everything equal across Core2 / SKL, except for the partial-register behaviour.
The Core2 behaviour is consistent with Agner Fog's microarch guide, but the HSW/SKL behaviour isn't. From section 11.10 for Skylake, and same for previous Intel uarches:
Different parts of a general purpose register can be stored in different temporary registers in order to remove false dependences.
He unfortunately doesn't have time to do detailed testing for every new uarch to re-test assumptions, so this change in behaviour slipped through the cracks.
Agner does describe a merging uop being inserted (without stalling) for high8 registers (AH/BH/CH/DH) on Sandybridge through Skylake, and for low8/low16 on SnB. (I've unfortunately been spreading mis-information in the past, and saying that Haswell can merge AH for free. I skimmed Agner's Haswell section too quickly, and didn't notice the later paragraph about high8 registers. Let me know if you see my wrong comments on other posts, so I can delete them or add a correction. I will try to at least find and edit my answers where I've said this.)
My actual questions: How exactly do partial registers really behave on Skylake?
Is everything the same from IvyBridge to Skylake, including the high8 extra latency?
Intel's optimization manual is not specific about which CPUs have false dependencies for what (although it does mention that some CPUs have them), and leaves out things like reading AH/BH/CH/DH (high8 registers) adding extra latency even when they haven't been modified.
If there's any P6-family (Core2/Nehalem) behaviour that Agner Fog's microarch guide doesn't describe, that would be interesting too, but I should probably limit the scope of this question to just Skylake or Sandybridge-family.
My Skylake test data, from putting %rep 4 short sequences inside a small dec ebp/jnz loop that runs 100M or 1G iterations.  I measured cycles with Linux perf the same way as in my answer here, on the same hardware (desktop Skylake i7 6700k).
Unless otherwise noted, each instruction runs as 1 fused-domain uop, using an ALU execution port.  (Measured with ocperf.py stat -e ...,uops_issued.any,uops_executed.thread).  This detects (absence of) mov-elimination and extra merging uops.
The "4 per cycle" cases are an extrapolation to the infinitely-unrolled case. Loop overhead takes up some of the front-end bandwidth, but anything better than 1 per cycle is an indication that register-renaming avoided the write-after-write output dependency, and that the uop isn't handled internally as a read-modify-write.
Writing to AH only:  prevents the loop from executing from the loopback buffer (aka the Loop Stream Detector (LSD)).  Counts for lsd.uops are exactly 0 on HSW, and tiny on SKL (around 1.8k) and don't scale with the loop iteration count.  Probably those counts are from some kernel code.  When loops do run from the LSD, lsd.uops ~= uops_issued to within measurement noise.  Some loops alternate between LSD or no-LSD (e.g when they might not fit into the uop cache if decode starts in the wrong place), but I didn't run into that while testing this.
- repeated mov ah, bhand/ormov ah, blruns at 4 per cycle. It takes an ALU uop, so it's not eliminated likemov eax, ebxis.
- repeated mov ah, [rsi]runs at 2 per cycle (load throughput bottleneck).
- repeated mov ah, 123runs at 1 per cycle. (A dep-breakingxor eax,eaxinside the loop removes the bottleneck.)
- repeated - setz ahor- setc ahruns at 1 per cycle. (A dep-breaking- xor eax,eaxlets it bottleneck on p06 throughput for- setccand the loop branch.)- Why does writing - ahwith an instruction that would normally use an ALU execution unit have a false dependency on the old value, while- mov r8, r/m8doesn't (for reg or memory src)? (And what about- mov r/m8, r8? Surely it doesn't matter which of the two opcodes you use for reg-reg moves?)
- repeated - add ah, 123runs at 1 per cycle, as expected.
- repeated add dh, clruns at 1 per cycle.
- repeated add dh, dhruns at 1 per cycle.
- repeated add dh, chruns at 0.5 per cycle. Reading [ABCD]H is special when they're "clean" (in this case, RCX is not recently modified at all).
Terminology: All of these leave AH (or DH) "dirty", i.e. in need of merging (with a merging uop) when the rest of the register is read (or in some other cases).  i.e. that AH is renamed separately from RAX, if I'm understanding this correctly.  "clean" is the opposite.  There are many ways to clean a dirty register, the simplest being inc eax or mov eax, esi.
Writing to AL only: These loops do run from the LSD: uops_issue.any ~= lsd.uops.
- repeated mov al, blruns at 1 per cycle. An occasional dep-breakingxor eax,eaxper group lets OOO execution bottleneck on uop throughput, not latency.
- repeated mov al, [rsi]runs at 1 per cycle, as a micro-fused ALU+load uop. (uops_issued=4G + loop overhead, uops_executed=8G + loop overhead). A dep-breakingxor eax,eaxbefore a group of 4 lets it bottleneck on 2 loads per clock.
- repeated mov al, 123runs at 1 per cycle.
- repeated mov al, bhruns at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H is special.
- xor eax,eax+ 6x- mov al,bh+- dec ebp/jnz: 2c per iter, bottleneck on 4 uops per clock for the front-end.
- repeated add dl, chruns at 0.5 per cycle. (1 per 2 cycles). Reading [ABCD]H apparently creates extra latency fordl.
- repeated add dl, clruns at 1 per cycle.
I think a write to a low-8 reg behaves as a RMW blend into the full reg, like add eax, 123 would be, but it doesn't trigger a merge if ah is dirty.  So (other than ignoring AH merging) it behaves the same as on CPUs that don't do partial-reg renaming at all.  It seems AL is never renamed separately from RAX?
- inc al/- inc ahpairs can run in parallel.
- mov ecx, eaxinserts a merging uop if- ahis "dirty", but the actual- movis renamed. This is what Agner Fog describes for IvyBridge and later.
- repeated movzx eax, ahruns at one per 2 cycles. (Reading high-8 registers after writing full regs has extra latency.)
- movzx ecx, alhas zero latency and doesn't take an execution port on HSW and SKL. (Like what Agner Fog describes for IvyBridge, but he says HSW doesn't rename movzx).
- movzx ecx, clhas 1c latency and takes an execution port. (mov-elimination never works for the- same,samecase, only between different architectural registers.)- A loop that inserts a merging uop every iteration can't run from the LSD (loop buffer)? 
I don't think there's anything special about AL/AH/RAX vs. B*, C*, DL/DH/RDX.  I have tested some with partial regs in other registers (even though I'm mostly showing AL/AH for consistency), and have never noticed any difference.
How can we explain all of these observations with a sensible model of how the microarch works internally?
Related:  Partial flag issues are different from partial register issues.  See INC instruction vs ADD 1: Does it matter? for some super-weird stuff with shr r32,cl (and even shr r32,2 on Core2/Nehalem: don't read flags from a shift other than by 1).
See also Problems with ADC/SBB and INC/DEC in tight loops on some CPUs for partial-flag stuff in adc loops.
mov al, 123runs at 1 per cycle? butmovl eax, 123repeated runs at 4cycles / iteration? Nevermind, its becausemov al, 123is not dependency breaking. - Noah