How does address operand affect performance and size of machine code?

Question

Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.

For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer).

addl   $8, %rdi                # skip eight values: advance index by 8
movl   (%rbx, %rdi, 4), %eax   # load data: pointer + scaled index
movl   4(%rbx, %rdi, 4), %edx  # load data: pointer + scaled index + displacement

As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:

addl  $32, %rbx      # skip eight values: move pointer forward by 32 bytes
movl  (%rbx), %eax   # load data: pointer
addl  $4, %rbx       # point next value: move pointer forward by 4 bytes
movl  (%rbx), %edx   # load data: pointer

In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.

What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?

Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?

Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!

General answer #0: Optimization is voodoo, and things like adding instructions or using longer instructions can speed things up in some cases. These behaviors can vary from CPU to CPU; something true on one model may not be true on a newer model. In your case, things can go either way, and there is no good way to predict without simply measuring. — Nayuki
General answer #1: agner.org/optimize ; intel.com/content/www/us/en/architecture-and-technology/… — Nayuki
@NayukiMinase, some v useful links. Very worth a browse. Thanks. — TerryE
@Nayuke Minase: optimization is not voodoo though it may appear to be to someone uninformed examining generated assembly code. Neither is it true that "there is no good way to predict without simply measuring." Quality compilers do a very good job of predicting without measuring - simply or otherwise. — Olof Forshell
I know of no such editor as you are asking about. A very complete sampling tool is available from AMD: CodeAnalyst for 32- and 64-bit x86. — Olof Forshell

johnfound johnfound · Accepted Answer · 2013-09-02T06:49:58

The address arithmetic is very fast and should be used always if possible.

But here is something that the question misses.

At first you can't multiply by 32 using address arithmetic - 8 is the maximal possible constant.

The first version of the code in not complete, because it will need second instruction, that to increment rbx. So, we have following two variants:

inc  rbx          
mov  eax, [8*rbx+rdi]

vs

add  rbx, 8
mov  eax, [rbx]

This way, the speed of the two variants will be the same. The size is the same - 6 bytes as well.

So, what code is better depends only on the program context - if we have a register that already contains the address of the needed array cell - use mov eax, [rbx]

If we have register containing the index of the cell and another containing the start address, then use the first variant. This way, after the algorithm ends, we still will have the start address of the array in rdi.

How does address operand affect performance and size of machine code?

2 Answers