This is a perfectly safe and useful optimization, similar to using an 8-bit immediate instead of a 32-bit immediate when you write add eax, 1
.
NASM only optimizes when the shorter form of the instruction has an identical architectural effect, because mov eax,1
implicitly zeros the upper 32 bits of RAX. Note that add rax, 0
is different from add eax, 0
so NASM can't optimize that: Only instructions like mov r32,...
/ mov r64,...
or xor eax,eax
that don't depend on the old value of the 32 vs. 64-bit register can be optimized this way.
You can disable it with nasm -O1
(the default is -O2
), but note that you'll get 10-byte mov rax, strict qword 1
in that case: clearly NASM isn't intended to really be used with less than normal optimization. There isn't a setting where it will use the shortest encoding that wouldn't change the disassembly (e.g. 7-byte mov rax, sign_extended_imm32
= mov rax, strict dword 1
).
The difference between -O0
and -O1
is in imm8 vs. imm32, e.g. add rax, 1
is
48 83 C0 01
(add r/m64, sign_extended_imm8
) with -O1
, vs.
48 05 01000000
(add rax, sign_extended_imm32
) with nasm -O0
.
Amusingly it still optimized by picking the special-case opcode that implies an RAX destination instead of taking a ModRM byte. Unfortunately -O1
doesn't optimize immediate sizes for mov
(where sign_extended_imm8 isn't possible.)
If you ever need a specific encoding somewhere, ask for it with strict
instead of disabling optimization.
Note that YASM doesn't do this operand-size optimization, so it's a good idea to make the optimization yourself in the asm source, if you care about code-size (even indirectly for performance reasons) in code that could be assembled with other NASM-compatible assemblers.
For instructions where 32 and 64-bit operand size wouldn't be equivalent if you had very large (or negative) numbers, you need to use 32-bit operand-size explicitly even if you're assembling with NASM instead of YASM, if you want the size / performance advantage.
The advantages of using 32bit registers/instructions in x86-64
For 32-bit constants that don't have their high bit set, zero or sign extending them to 64 bits gives an identical result. Thus it's a pure optimization to assemble mov rax, 1
to a 5-byte mov r32, imm32
(with implicit zero extension to 64 bits) instead of a 7-byte mov r/m64, sign_extended_imm32
.
(See Difference between movq and movabsq in x86-64 for more details about the forms of mov
x86-64 allows; AT&T syntax has a special name for the 10-byte immediate form but NASM doesn't.)
On all current x86 CPUs, the only performance difference between that and the 7-byte encoding is code-size, so only indirect effects like alignment and L1I$ pressure are a factor. Internally it's just a mov-immediate, so this optimization doesn't change the microarchitectural effect of your code either (except of course for code-size / alignment / how it packs in the uop cache).
The 10-byte mov r64, imm64
encoding is even worse for code size. If the constant actually has any of its high bits set, then it has extra inefficiency in the uop cache on Intel Sandybridge-family CPUs (using 2 entries in the uop cache, and maybe an extra cycle to read from the uop cache). But if the constant is in the -2^31 .. +2^31 range (signed 32-bit), it's stored internally just as efficiently, using only a single uop-cache entry, even if it was encoded in the x86 machine code using a 64-bit immediate. (See Agner Fog's microarch doc, Table 9.1. Size of different instructions in μop cache in the Sandybridge section)
From How many ways to set a register to zero?, you can force any of the three encodings:
mov eax, 1 ; 5 bytes to encode (B8 imm32)
mov rax, strict dword 1 ; 7 bytes: REX mov r/m64, sign-extended-imm32. NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason
mov rax, strict qword 1 ; 10 bytes to encode (REX B8 imm64). movabs mnemonic for AT&T. Normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.
Note that NASM used the 10-byte encoding (which AT&T syntax calls movabs
, and so does objdump
in Intel-syntax mode) for an address which is a link-time constant but unknown at assemble time.
YASM chooses mov r64, imm32
, i.e. it assumes a code-model where label addresses are 32 bits, unless you use mov rsi, strict qword msg
YASM's behaviour is normally good (although using mov r32, imm32
for static absolute addresses like C compilers do would be even better). The default non-PIC code-model puts all static code/data in the low 2GiB of virtual address space, so zero- or sign-extended 32-bit constants can hold addresses.
If you want 64-bit label addresses you should normally use lea r64, [rel address]
to do a RIP-relative LEA. (On Linux at least, position-dependent code can go in the low 32, so unless you're using the large / huge code models, any time you need to care about 64-bit label addresses, you're also making PIC code where you should use RIP-relative LEA to avoid needing text relocations of absolute address constants).
i.e. gcc and other compilers would have used mov esi, msg
, or lea rsi, [rel msg]
, never mov rsi, msg
.
mov rax,1
has exactly the same effect asmov eax,1
(because on x86-64 writing to 32 bit register variant likeeax
will automatically clear upper 32 bits of the 64 bitrax
, that's how AMD designed the x86-64). And theeax
variant is 1B shorter opcode for the tiny immediate (therax
has exactly same opcode with REX prefix byte ahead). - But I didn't think it is doing it even in this case, surprised me a bit (I was aware only ofmov eax,1
picking theimm8
opcode variant automatically, unless you writemov eax, dword 1
to force it to useimm32
one). – Ped7g