Use movzx
to load narrow data on modern CPUs. (Or movsx
if it's useful to have it sign-extended instead of zero-extended, but movzx
is sometimes faster and never slower.)
is only slow on the ancient P5 (original Pentium) microarchitecture, not anything made this century. Pentium-branded CPUs based on recent microarchitectures, like Pentium G3258 (Haswell, 20th anniversary edition of original Pentium) are totally different beasts, and perform like the equivalent i3 but without AVX, BMI1/2, or hyperthreading.
Don't tune modern code based on P5 guidelines / numbers. However, Knight's Corner (Xeon Phi) is based on a modified P54C microarchitecture, so perhaps it has slow movzx
as well. Neither Agner Fog nor Instlatx64 have per-instruction throughput / latency numbers for KNC.
Using a 16-bit operand size instruction doesn't switch the whole pipeline over to 16-bit mode or cause a big perf hit. See Agner Fog's microarch pdf to learn exactly what is and isn't slow on various x86 CPU microarchitectures (including ones as old as Intel P5 (original Pentium) which you seem to be talking about for some reason).
Writing a 16-bit register and then reading the full 32/64-bit register is slow on some CPU (partial-register stall when merging on Intel P6-family). On others, writing a 16-bit register merges into the old value so there's a false dependency on the old value of the full register when you write, even if you never read the full register. See which CPU does what. (Note that Haswell/Skylake only rename AH separately, unlike Sandybridge which (like Core2/Nehalem) also renames AL / AX separately from RAX, but merges without stalling.)
Unless you specifically care about in-order P5 (or possibly Knight's Corner Xeon Phi, based on the same core, but IDK if movzx
is slow there, too), USE THIS:
movzx eax, word [src1] ; as efficient as a 32-bit MOV load on most CPUs
cmp ax, word [src2]
The operand-size prefix for cmp
decodes efficiently on all modern CPUs. Reading a 16-bit register after writing the full register is always fine, and the 16-bit load for the other operand is also fine.
The operand-size prefix isn't length-changing because there's no imm16 / imm32. e.g. cmp word [src2], 0x7F
is fine (it can use a sign-extended imm8), but
cmp word [src2], 0x80
needs an imm16 and will LCP-stall on some Intel CPUs. (Without the operand-size prefix, the same opcode would have an imm32, i.e. the rest of the instruction would be a different length). Instead, use mov eax, 0x80
/ cmp word [src2], ax
The address-size prefix can be length-changing in 32-bit mode (disp32 vs. disp16), but we don't want to use 16-bit addressing modes to access 16-bit data. We're still using [ebx+1234]
(or rbx
), not [bx+1234]
On modern x86: Intel P6 / SnB-family / Atom / Silvermont, AMD since at least K7, i.e. anything made in this century, newer than actual P5 Pentium, movzx
loads are very efficient.
On many CPUs, the load ports directly support movzx
(and sometimes also movsx
), so it runs as just a load uop, not as a load + ALU.
Data from Agner Fog's instruction-set tables: Note they may not cover every corner case, e.g. mov
-load numbers might only be for 32 / 64-bit loads. Also note that Agner Fog's load latency numbers are not load-use latency from L1D cache; they only make sense as part of the store/reload (store-forwarding) latency, but relative numbers will tell us how many cycles movzx
adds on top of mov
(often no extra cycles).
(Update: has better test results that actually reflect load-use latency, and they're automated so typos and clerical errors in updating the spreadsheets aren't a problem. But only goes back to Conroe (first-gen Core 2) for Intel, and only Zen for AMD.)
P5 Pentium (in-order execution): movzx
-load is a 3-cycle instruction (plus a decode bottleneck from the 0F
prefix), vs. mov
-loads being single cycle throughput. (They still have latency, though).
PPro / Pentium II / III: movzx
run on just a load port, same throughput as plain mov
Core2 / Nehalem: same, including 64-bit movsxd
, except on Core 2 where a movsxd r64, m32
load costs a load + ALU uop, which don't micro-fuse.
Sandybridge-family (SnB through Skylake and later): movzx
loads are single-uop (just a load port), and perform identically to mov
Pentium4 (netburst): movzx
runs on the load port only, same perf as mov
. movsx
is load + ALU, and takes 1 extra cycle.
Atom (in-order): Agner's table is unclear for memory-source movzx
needing an ALU, but they're definitely fast. The latency number is only for reg,reg.
Silvermont: same as Atom: fast but unclear on needing a port.
KNL (based on Silvermont): Agner lists movzx
with a memory source as using IP0 (ALU), but latency is the same as mov r,m
so there's no penalty. (execution-unit pressure is not a problem because KNL's decoders can barely keep its 2 ALUs fed anyway.)
Bobcat: movzx
loads are 1 per clock, 5 cycle latency. mov
-load is 4c latency.
Jaguar: movzx
loads are 1 per clock, 4 cycle latency. mov
loads are 1 per clock, 3c latency for 32/64-bit, or 4c for mov r8/r16, m
(but still only an AGU port, not an ALU merge like Haswell/Skylake do).
K7/K8/K10: movzx
loads have 2-per-clock throughput, latency 1 cycle higher than a mov
load. They use an AGU and an ALU.
Bulldozer-family: same as K10, but movsx
-load has 5 cycle latency. movzx
-load has 4 cycle latency, mov
-load has 3 cycle latency. So in theory it might be lower latency to mov cx, word [mem]
and then movsx eax, cx
(1 cycle), if the false dependency from a 16-bit mov
load doesn't require an extra ALU merge, or create a loop-carried dependency for your loop.
Ryzen: movzx
loads run in the load port only, same latency as mov
Via Nano 2000/3000: movzx
runs on the load port only, same latency as mov
loads. movsx
is LD + ALU, with 1c extra latency.
When I say "perform identically", I mean not counting any partial-register penalties or cache-line splits from a wider load. e.g. a movzx eax, word [rsi]
avoids a merging penalty vs mov ax, word [rsi]
on Skylake, but I'll still say that mov
performs identically to movzx
. (I guess I mean that mov eax, dword [rsi]
without any cache-line splits is as fast as movzx eax, word [rsi]
-zeroing the full register before writing a 16-bit register avoids a later partial-register merging stall on Intel P6-family, as well as breaking false dependencies.
If you want to run well on P5 as well, this might be somewhat better there while not being much worse on any modern CPUs except PPro to PIII where xor
-zeroing isn't dep-breaking, even though it is still recognized as a zeroing-idiom making EAX equivalent to AX (no partial-register stall when reading EAX after writing AL or AX).
;; Probably not a good idea, maybe not faster on anything.
;mov eax, 0 ; some code tuned for PIII used *both* this and xor-zeroing.
xor eax, eax ; *not* dep-breaking on early P6 (up to PIII)
mov ax, word [src1]
cmp ax, word [src2]
; safe to read EAX without partial-reg stalls
The operand-size prefix isn't ideal for P5, so you could consider using a 32-bit load if you're sure it doesn't fault, cross a cache-line boundary, or cause a store-forwarding failure from a recent 16-bit store.
Actually, I think a 16-bit mov
load might be slower on Pentium than the movzx
2 instruction sequence. There really doesn't seem to be a good option for working with 16-bit data as efficiently as 32-bit! (Other than packed MMX stuff, of course).
See Agner Fog's guide for the Pentium details, but the operand-size prefix takes an extra 2 cycles to decode on P1 (original P5) and PMMX, so this sequence may actually be worse than a movzx
load. On P1 (but not PMMX), the 0F
escape byte (used by movzx
) also counts as a prefix, taking an extra cycle to decode.
Apparently movzx
isn't pairable anyway. Multi-cycle movzx
will hide the decode latency of cmp ax, [src2]
, so movzx
/ cmp
is probably still the best choice. Or schedule instructions so the movzx
is done earlier and the cmp
can maybe pair with something. Anyway, the scheduling rules are quite complicated for P1/PMMX.
I timed this loop on Core2 (Conroe) to prove that xor-zeroing avoids partial register stalls for 16-bit registers as well as low-8 (like for setcc al
mov ebp, 100000000
%rep 4
xor eax, eax
; mov eax, 1234 ; just break dep on the old value, not a zeroing idiom
mov ax, cx ; write AX
mov edx, eax ; read EAX
dec ebp ; Core2 can't fuse dec / jcc even in 32-bit mode
jg .loop ; but SnB does
perf stat -r4 ./testloop
output for this in a static binary that makes a sys_exit system call after :
;; Core2 (Conroe) with XOR eax, eax
469,277,071 cycles # 2.396 GHz
1,400,878,601 instructions # 2.98 insns per cycle
100,156,594 branches # 511.462 M/sec
9,624 branch-misses # 0.01% of all branches
0.196930345 seconds time elapsed ( +- 0.23% )
2.98 instructions per cycle makes sense: 3 ALU ports, all instructions are ALU, and there's no macro-fusion, so each is 1 uop. So we're running at 3/4 of the front-end capacity. The loop has 3*4 + 2
instructions / uops.
Things are very different on Core2 with the xor
-zeroing commented and using the mov eax, imm32
;; Core2 (Conroe) with MOV eax, 1234
1,553,478,677 cycles # 2.392 GHz
1,401,444,906 instructions # 0.90 insns per cycle
100,263,580 branches # 154.364 M/sec
15,769 branch-misses # 0.02% of all branches
0.653634874 seconds time elapsed ( +- 0.19% )
0.9 IPC (down from 3) is consistent with the front-end stalling for 2 to 3 cycles to insert a merging uop on every mov edx, eax
Skylake runs both loops identically, because mov eax,imm32
is still dependency-breaking. (Like most instructions with a write-only destination, but beware of false dependencies from popcnt
and lzcnt
Actually, the uops_executed.thread
perf counter does show a difference: on SnB-family, xor-zeroing doesn't take an execution unit because it's handled in the issue/rename stage. (mov edx,eax
is also eliminated at rename, so the uop count is actually quite low). The cycle count is the same to within less than 1% either way.
;;; Skylake (i7-6700k) with xor-zeroing
Performance counter stats for './testloop' (4 runs):
84.257964 task-clock (msec) # 0.998 CPUs utilized ( +- 0.21% )
0 context-switches # 0.006 K/sec ( +- 57.74% )
0 cpu-migrations # 0.000 K/sec
3 page-faults # 0.036 K/sec
328,337,097 cycles # 3.897 GHz ( +- 0.21% )
100,034,686 branches # 1187.243 M/sec ( +- 0.00% )
1,400,195,109 instructions # 4.26 insn per cycle ( +- 0.00% ) ## dec/jg fuses into 1 uop
1,300,325,848 uops_issued_any # 15432.676 M/sec ( +- 0.00% ) ### fused-domain
500,323,306 uops_executed_thread # 5937.994 M/sec ( +- 0.00% ) ### unfused-domain
0 lsd_uops # 0.000 K/sec
0.084390201 seconds time elapsed ( +- 0.22% )
lsd.uops is zero because the loop buffer is disabled by a microcode update. This bottlenecks on the front-end: uops (fused-domain) / clock = 3.960 (out of 4). That last .04 might be partly OS overhead (interrupts and so on), because this is only counting user-space uops.
movsx reg,r/m16
cost 1/1 cyce. LCP stalls are heavily architecture dependent. The Intel advice is to load 32 bits and only use the 16-bit register. – Hans Passantmovzx
is fast on anything after PMMX, so you could just use that. – haroldmovzx
as appropriate. They are fast on anything made in the last decade. – Stephen Canon