About gcc-compiled x86_64 code and C code optimization

Question

I compiled the following C code:

typedef struct {
    long x, y, z;
} Foo;

long Bar(Foo *f, long i)
{
    return f[i].x + f[i].y + f[i].z;
}

with the command gcc -S -O3 test.c. Here is the Bar function in the output:

    .section    __TEXT,__text,regular,pure_instructions
    .globl  _Bar
    .align  4, 0x90
_Bar:
Leh_func_begin1:
    pushq   %rbp
Ltmp0:
    movq    %rsp, %rbp
Ltmp1:
    leaq    (%rsi,%rsi,2), %rcx
    movq    8(%rdi,%rcx,8), %rax
    addq    (%rdi,%rcx,8), %rax
    addq    16(%rdi,%rcx,8), %rax
    popq    %rbp
    ret
Leh_func_end1:

I have a few questions about this assembly code:

What is the purpose of "pushq %rbp", "movq %rsp, %rbp", and "popq %rbp", if neither rbp nor rsp is used in the body of the function?
Why do rsi and rdi automatically contain the arguments to the C function (i and f, respectively) without reading them from the stack?
I tried increasing the size of Foo to 88 bytes (11 longs) and the leaq instruction became an imulq. Would it make sense to design my structs to have "rounder" sizes to avoid the multiply instructions (in order to optimize array access)? The leaq instruction was replaced with:
```
imulq   $88, %rsi, %rcx
```

Daniel Kamil Kozar Daniel Kamil Kozar · Accepted Answer · 2012-06-04T19:16:40

The function is simply building its own stack frame with these instructions. There's nothing really unusual about them. You should note, though, that due to this function's small size, it will probably be inlined when used in the code. The compiler is always required to produce a "normal" version of the function, though. Also, what @ouah said in his answer.
This is because that's how the AMD64 ABI specifies the arguments should be passed to functions.

If the class is INTEGER, the next available register of the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9 is used.

Page 20, AMD64 ABI Draft 0.99.5 – September 3, 2010
This is not directly related to the structure size, rather - the absolute address that the function has to access. If the size of the structure is 24 bytes, f is the address of the array containing the structures, and i is the index at which the array has to be accessed, then the byte offset to each structure is i*24. Multiplying by 24 in this case is achieved by a combination of lea and SIB addressing. The first lea instruction simply calculates i*3, then every subsequent instruction uses that i*3 and multiplies it further by 8, therefore accessing the array at the needed absolute byte offset, and then using immediate displacements to access the individual structure members ((%rdi,%rcx,8). 8(%rdi,%rcx,8), and 16(%rdi,%rcx,8)). If you make the size of the structure 88 bytes, there is simply no way of doing such a thing swiftly with a combination of lea and any kind of addressing. The compiler simply assumes that a simple imull will be more efficient in calculating i*88 than a series of shifts, adds, leas or anything else.

About gcc-compiled x86_64 code and C code optimization

3 Answers