2
votes

Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.

For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi as data index, %rbx as base pointer).

addl   $8, %rdi                # skip eight values: advance index by 8
movl   (%rbx, %rdi, 4), %eax   # load data: pointer + scaled index
movl   4(%rbx, %rdi, 4), %edx  # load data: pointer + scaled index + displacement

As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:

addl  $32, %rbx      # skip eight values: move pointer forward by 32 bytes
movl  (%rbx), %eax   # load data: pointer
addl  $4, %rbx       # point next value: move pointer forward by 4 bytes
movl  (%rbx), %edx   # load data: pointer

In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.

What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?

Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?


Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!

2
General answer #0: Optimization is voodoo, and things like adding instructions or using longer instructions can speed things up in some cases. These behaviors can vary from CPU to CPU; something true on one model may not be true on a newer model. In your case, things can go either way, and there is no good way to predict without simply measuring.Nayuki
@NayukiMinase, some v useful links. Very worth a browse. Thanks.TerryE
@Nayuke Minase: optimization is not voodoo though it may appear to be to someone uninformed examining generated assembly code. Neither is it true that "there is no good way to predict without simply measuring." Quality compilers do a very good job of predicting without measuring - simply or otherwise.Olof Forshell
I know of no such editor as you are asking about. A very complete sampling tool is available from AMD: CodeAnalyst for 32- and 64-bit x86.Olof Forshell

2 Answers

2
votes

The address arithmetic is very fast and should be used always if possible.

But here is something that the question misses.

At first you can't multiply by 32 using address arithmetic - 8 is the maximal possible constant.

The first version of the code in not complete, because it will need second instruction, that to increment rbx. So, we have following two variants:

inc  rbx          
mov  eax, [8*rbx+rdi]

vs

add  rbx, 8
mov  eax, [rbx]

This way, the speed of the two variants will be the same. The size is the same - 6 bytes as well.

So, what code is better depends only on the program context - if we have a register that already contains the address of the needed array cell - use mov eax, [rbx]

If we have register containing the index of the cell and another containing the start address, then use the first variant. This way, after the algorithm ends, we still will have the start address of the array in rdi.

2
votes

The answer to your question depends on the given, local program flow circumstances - and they, in turn, may very somewhat between processor manufacturers and architectures. To microanalyze a single instruction or two is usually pointless. You have a multi-stage pipeline, more than one integer unit, caches and lots more that come into play and which you need to factor into your analysis.

You could try reverse-engineering by looking at generated assembly code and analyzing why the sequence looks the way it does with respect to the different hardware units that will be working on it.

Another way is to use a profiler and experiment with different constructs to see what works well and what doesn't.

You could also download the source code for gcc and see how the really cool programmers evaluate a sequence to produce the fastest possible code. One day you might become one of them :-)

In any case I expect that you will come to the conclusion that the best sequence varies greatly depending on processor, compiler, optimization level and surrounding instructions. If you are using C the quality of the source code is extremely important: garbage in = garbage out.