Starting with 32-bit CPU mode, there are extended address operands available for x86 architecture. One can specify the base address, a displacement, an index register and a scaling factor.
For example, we would like to stride through a list of 32-bit integers (every first two from an array of 32-byte-long data structures, %rdi
as data index, %rbx
as base pointer).
addl $8, %rdi # skip eight values: advance index by 8
movl (%rbx, %rdi, 4), %eax # load data: pointer + scaled index
movl 4(%rbx, %rdi, 4), %edx # load data: pointer + scaled index + displacement
As I know, such complex addressing fits into a single machine-code instruction. But what is the cost of such operation and how does it compare to simple addressing with independent pointer calculation:
addl $32, %rbx # skip eight values: move pointer forward by 32 bytes
movl (%rbx), %eax # load data: pointer
addl $4, %rbx # point next value: move pointer forward by 4 bytes
movl (%rbx), %edx # load data: pointer
In the latter example, I have introduced one extra instruction and a dependency. But integer addition is very fast, I gained simpler address operands, and there are no multiplications any more. On the other hand, since the allowed scaling factors are powers of 2, the multiplication comes down to a bit shift, which is also a very fast operation. Still, two additions and a bit shift can be replaced with one addition.
What are the performance and code size differences between these two approaches? Are there any best practices for using the extended addressing operands?
Or, asking it from a C programmer's point of view, what is faster: array indexing or pointer arithmetic?
Is there any assembly editor meant for size/performance tuning? I wish I could see the machine-code size of each assembly instruction, its execution time in clock cycles or a dependency graph. There are thousands of assembly freaks that would benefit from such application, so I bet that something like this already exists!