To efficiently do x = x*10 + 1
, it's probably optimal to use
lea eax, [rax + rax*4] ; x*=5
lea eax, [1 + rax*2] ; x = x*2 + 1
3-component LEA has higher latency on modern Intel CPUs, e.g. 3 cycles vs. 1 on Sandybridge-family, so disp32 + index*2
is faster than disp8 + base + index*1
on SnB-family, i.e. most of the mainstream x86 CPUs we care about optimizing for. (This mostly only applies to LEA, not loads/stores, because LEA runs on ALU execution units, not the AGUs in most modern x86 CPUs.) AMD CPUs have slower LEA with 3 components or scale > 1
(http://agner.org/optimize/)
But NASM and YASM will optimize for code-size by using [1 + rax + rax*1]
for the 2nd LEA, which only needs a disp8 instead of a disp32. (Addressing modes always have a base register or a disp32).
i.e. they always split reg*2
into base+index
, because that's never worse for code-size.
I can force using a disp32 with lea eax, [dword 1 + rax*2]
, but that doesn't stop NASM or YASM from splitting the addressing mode. The NASM manual doesn't seem to document a way to use the strict
keyword on the scale factor, and [1 + strict rax*2]
doesn't assemble. Is there a way to use strict
or some other syntax to force the desired encoding of an addressing mode?
nasm -O0
to disable optimizations doesn't work. Apparently that only controls multi-pass branch-displacement optimization, not all optimizations NASM makes. Of course you don't want to do that in the first place for a whole source file, even if it did work. I still get
8d 84 00 01 00 00 00 lea eax,[rax+rax*1+0x1]
The only workaround I can think of is to encode it manually with db
. This is quite inconvenient. For the record, the manual-encoding is:
db 0x8d, 0x04, 0x45 ; opcode, modrm, SIB for lea eax, [disp32 + rax*2]
dd 1 ; disp32
The scale factor is encoded in the high 2 bits of the SIB byte. I assembled lea eax, [dword 1 + rax*4]
to get the machine code for the right registers, because NASM's optimization only works for *2
. The SIB was 0x85
, and decrementing that 2-bit field at the top of the byte reduced the scale factor from 4 to 2.
But the question is: how to write it in a nicely readable way that makes it easy to change registers, and get NASM to encode the addressing mode for you? (I suppose a giant macro could do this with text processing and manual db
encoding, but that's not really the answer I'm looking for. I don't actually need this for anything right now, I mostly want to know if NASM or YASM has syntax to force this.)
Other optimizations I'm aware of, like mov rax, 1
assembling to 5-byte mov eax,1
are pure wins on all CPUs unless you want longer instructions to get padding without NOPs, and can be disabled with mov rax, strict dword 1
to get the 7-byte sign-extended encoding, or strict qword
for 10-byte imm64.
gas doesn't do this or most other optimizations (only sizes of immediates and branch displacements): lea 1(,%rax,2), %eax
assembles to8d 04 45 01 00 00 00 lea eax,[rax*2+0x1]
, and same for the .intel_syntax noprefix
version.
Answers for MASM or other assemblers would also be interesting, though.