9
votes

To efficiently do x = x*10 + 1, it's probably optimal to use

lea   eax, [rax + rax*4]   ; x*=5
lea   eax, [1 + rax*2]     ; x = x*2 + 1

3-component LEA has higher latency on modern Intel CPUs, e.g. 3 cycles vs. 1 on Sandybridge-family, so disp32 + index*2 is faster than disp8 + base + index*1 on SnB-family, i.e. most of the mainstream x86 CPUs we care about optimizing for. (This mostly only applies to LEA, not loads/stores, because LEA runs on ALU execution units, not the AGUs in most modern x86 CPUs.) AMD CPUs have slower LEA with 3 components or scale > 1 (http://agner.org/optimize/)

But NASM and YASM will optimize for code-size by using [1 + rax + rax*1] for the 2nd LEA, which only needs a disp8 instead of a disp32. (Addressing modes always have a base register or a disp32).

i.e. they always split reg*2 into base+index, because that's never worse for code-size.

I can force using a disp32 with lea eax, [dword 1 + rax*2], but that doesn't stop NASM or YASM from splitting the addressing mode. The NASM manual doesn't seem to document a way to use the strict keyword on the scale factor, and [1 + strict rax*2] doesn't assemble. Is there a way to use strict or some other syntax to force the desired encoding of an addressing mode?


nasm -O0 to disable optimizations doesn't work. Apparently that only controls multi-pass branch-displacement optimization, not all optimizations NASM makes. Of course you don't want to do that in the first place for a whole source file, even if it did work. I still get

8d 84 00 01 00 00 00    lea    eax,[rax+rax*1+0x1]

The only workaround I can think of is to encode it manually with db. This is quite inconvenient. For the record, the manual-encoding is:

db 0x8d, 0x04, 0x45  ; opcode, modrm, SIB  for lea eax, [disp32 + rax*2]
dd 1                 ; disp32

The scale factor is encoded in the high 2 bits of the SIB byte. I assembled lea eax, [dword 1 + rax*4] to get the machine code for the right registers, because NASM's optimization only works for *2. The SIB was 0x85, and decrementing that 2-bit field at the top of the byte reduced the scale factor from 4 to 2.


But the question is: how to write it in a nicely readable way that makes it easy to change registers, and get NASM to encode the addressing mode for you? (I suppose a giant macro could do this with text processing and manual db encoding, but that's not really the answer I'm looking for. I don't actually need this for anything right now, I mostly want to know if NASM or YASM has syntax to force this.)

Other optimizations I'm aware of, like mov rax, 1 assembling to 5-byte mov eax,1 are pure wins on all CPUs unless you want longer instructions to get padding without NOPs, and can be disabled with mov rax, strict dword 1 to get the 7-byte sign-extended encoding, or strict qword for 10-byte imm64.


gas doesn't do this or most other optimizations (only sizes of immediates and branch displacements): lea 1(,%rax,2), %eax assembles to
8d 04 45 01 00 00 00 lea eax,[rax*2+0x1], and same for the .intel_syntax noprefix version.

Answers for MASM or other assemblers would also be interesting, though.

1

1 Answers

8
votes

NOSPLIT:

Similarly, NASM will split [eax*2] into [eax+eax] because that allows the offset field to be absent and space to be saved; in fact, it will also split [eax*2+offset] into [eax+eax+offset].
You can combat this behaviour by the use of the NOSPLIT keyword: [nosplit eax*2] will force [eax*2+0] to be generated literally.
[nosplit eax*1] also has the same effect. In another way, a split EA form [0, eax*2] can be used, too. However, NOSPLIT in [nosplit eax+eax] will be ignored because user's intention here is considered as [eax+eax].

lea eax, [NOSPLIT 1+rax*2]
lea eax, [1+rax*2]

00000000  8D044501000000    lea eax,[rax*2+0x1]
00000007  8D440001          lea eax,[rax+rax+0x1]