System V ABI - AMD64 - Stack alignment in GCC-emitted assembly

Question

For the C code below, GCC x86-64 10.2 from Compiler Explorer emits assembly that I pasted further below.

One instruction is subq $40, %rsp. The question is, how come subtracting 40 bytes from %rsp does not make the stack misaligned? My understanding is:

Right before call foo, the stack is 16 bytes aligned;
call foo places an 8 bytes return address on the stack, so the stack gets misaligned;
But pushq %rbp at foo's start places another 8 bytes on the stack, so it gets 16 bytes aligned again;
So the stack is 16 bytes aligned right before subq $40, %rsp. As a result, decreasing %rsp by 40 bytes must break the alignment?

Obviously, GCC emits valid assembly in terms of keeping the stack aligned, so I must be missing something.

(I tried replacing GCC with CLANG, and CLANG emits subq $48, %rsp — just as I'd intuitively expect.)

So, what am I missing in the GCC-generated assembly? How does it keep the stack 16 bytes aligned?

int bar(int i) { return i; }
int foo(int p0, int p1, int p2, int p3, int p4, int p5, int p6) {
    int sum = p0 + p1 + p2 + p3 + p4 + p5 + p6;
    return bar(sum);
}
int main() {
    return foo(0, 1, 2, 3, 4, 5, 6);
}

bar:
        pushq   %rbp
        movq    %rsp, %rbp
        movl    %edi, -4(%rbp)
        movl    -4(%rbp), %eax
        popq    %rbp
        ret
foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $40, %rsp
        movl    %edi, -20(%rbp)
        movl    %esi, -24(%rbp)
        movl    %edx, -28(%rbp)
        movl    %ecx, -32(%rbp)
        movl    %r8d, -36(%rbp)
        movl    %r9d, -40(%rbp)
        movl    -20(%rbp), %edx
        movl    -24(%rbp), %eax
        addl    %eax, %edx
        movl    -28(%rbp), %eax
        addl    %eax, %edx
        movl    -32(%rbp), %eax
        addl    %eax, %edx
        movl    -36(%rbp), %eax
        addl    %eax, %edx
        movl    -40(%rbp), %eax
        addl    %eax, %edx
        movl    16(%rbp), %eax
        addl    %edx, %eax
        movl    %eax, -4(%rbp)
        movl    -4(%rbp), %eax
        movl    %eax, %edi
        call    bar
        leave
        ret
main:
        pushq   %rbp
        movq    %rsp, %rbp
        pushq   $6
        movl    $5, %r9d
        movl    $4, %r8d
        movl    $3, %ecx
        movl    $2, %edx
        movl    $1, %esi
        movl    $0, %edi
        call    foo
        addq    $8, %rsp
        leave
        ret

Interesting find. Apparently the compiler figured bar does not require stack alignment so it didn't bother. If you make it extern int bar(int i); then the stack will be properly aligned. — Jester
Also if you change bar so it does require alignment, for example because it calls another function itself, the compiler notices that too. — Jester
I was curious about this optimization being done at -O0. Apparently it is a feature of ipa stack alignment which is the default in GCC. You can turn it on/off with -fipa-stack-alignment and -fno-ipa-stack-alignment in GCC versions >= 9.0. A comparison of the output with the option on/off in GCC: godbolt.org/z/a1YdjG — Michael Petch
Whether the functions can be called from outside ("above") is not really relevant here. The alignment requirement protects function below the current one and, since gcc can see that all functions below foo have no alignment requirements, it deems it unnecessary. — paxdiablo

paxdiablo paxdiablo · Accepted Answer · 2020-11-01T01:58:20

The purpose of the 16-byte alignment is so that functions being called at any level below the current, don't have to worry about aligning their stack if they require aligned locals.

Without the ABI guarantee, every function that needed this would have to and the stack pointer with some value to ensure it's correctly aligned, something like:

and %rsp, $0xfffffffffffffff0

However, there is no reason why this is necessary in this particular case - the bar() function is a leaf one, meaning the compiler has full knowledge of any alignment requirement at its level or below (it has no locals, and it calls no functions, hence no requirements).

The foo() function also has no requirement below, since the only thing it calls is bar(). It also appears to be deciding that it's own locals do not require that level of alignment either.

Even if bar() or foo() were called from outside the immediate translation unit (and they can be, since they're not marked static), that doesn't change the fact that alignment for them is not required.

Things would be different if, for example, bar was in a separate translation unit or it called other functions where it couldn't be ascertained that the alignment was not required.

That would mean gcc wouldn't have full knowledge of its alignment requirements. And, indeed, if you comment out the bar definition line in godbolt (effectively hiding the definition), you will see the line change:

// int bar(int i) { return i; }
   --> subq $48, %rsp             ; no longer $40

As an aside, although the 16-byte alignment is not technically necessary in this case, I think it may invalidate the claim that gcc uses the System V AMD64 ABI. There appears to be nothing in that ABI that allows for this deviation, the text (PDF) states (slightly paraphrased, and with my bold):

The end of the input argument area shall be aligned on a 16 (or 32 if __m256 is passed on stack) byte boundary. In other words, the value %rsp + 8 is always a multiple of 16 (or 32) when control is transferred to the function entry point. The stack pointer %rsp always points to the end of the latest allocated stack frame.

There appears to be little leeway in interpreting that in any way that makes the observed behaviour compatible, even though it's known not to cause a problem in this case.

Whether someone considers that important enough to worry about is outside the scope of this answer, I make no judgement on that point :-)

System V ABI - AMD64 - Stack alignment in GCC-emitted assembly

1 Answers