Why does llvm and gcc use different function prologs on x86 64?

Question

A trivial function I'm compiling with gcc and clang:

void test() {
    printf("hm");
    printf("hum");
}

$ gcc test.c -fomit-frame-pointer -masm=intel -O3 -S

sub rsp, 8
.cfi_def_cfa_offset 16
mov esi, OFFSET FLAT:.LC0
mov edi, 1
xor eax, eax
call    __printf_chk
mov esi, OFFSET FLAT:.LC1
mov edi, 1
xor eax, eax
add rsp, 8
.cfi_def_cfa_offset 8
jmp __printf_chk

And

$ clang test.c -mllvm --x86-asm-syntax=intel -fomit-frame-pointer -O3 -S    

# BB#0:
push    rax
.Ltmp1:
.cfi_def_cfa_offset 16
mov edi, .L.str
xor eax, eax
call    printf
mov edi, .L.str1
xor eax, eax
pop rdx
jmp printf                  # TAILCALL

The difference I'm interested in is that gcc uses sub rsp, 8/add rsp, 8 for the function prolog and clang uses push rax/pop rdx.

Why does the compilers use different function prologues? Which variant is better? push and pop certainly encodes to shorter instructions but are they faster or slower than add and sub?

The reason for the stack fiddling in the first place seems to be that the abi requires rsp to be 16 bytes aligned for non leaf procedures. I haven't been able to find any compiler flags that removes them.

Judging from your answers, it seems like push & pop is better. push rax + pop rdx = 1 + 1 = 2 vs. sub rsp, 8 + add rsp, 8 = 4 + 4 = 8. So the former pair saves 6 bytes at no expense.

It's a matter of choice. It's hard to tell which variant is better. Probably both variants are rather similar in terms of performance. — Jabberwocky
re: your edit. Yes, the ABI guarantees that at function entry, (%rsp + 8) is 16B aligned. (I editted most of this comment into my answer). — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2015-07-21T11:29:45

On Intel, sub / add will trigger the stack engine to insert an extra uop to synchronize %rsp for the out-of-order execution part of the pipeline. (See Agner Fog's microarch doc, specifically pg 91, about the stack engine. AFAIK, it still works the same on Haswell as on Pentium M, as far as when it needs to insert extra uops.

The push / pop will take fewer fused-domain uops, and so probably be more efficient even though they use the store/load ports. They come between call/ret pairs.

So, push / pop is at least not going to be slower, but takes fewer instruction bytes. Better I-cache density is good.

BTW, I think the point of the pair of insns is to keep the stack 16B-aligned after call pushes the 8B return address. This is one case where the ABI ends up requiring semi-useless instructions. More complex functions that need some stack space to spill locals, and then reload them after function calls, will typically sub $something, %rsp to reserve space.

The SystemV (Linux) amd64 ABI guarantees that at function entry, (%rsp + 8), where args on the stack will be, if there are any, will be 16B aligned. (http://x86-64.org/documentation/abi.pdf). You have to arrange for that to be the case for any function you call, or it's your fault if they segfault from using an SSE aligned load. Or otherwise crash from making assumptions about how they can use AND to mask an address or something.

Why does llvm and gcc use different function prologs on x86 64?

2 Answers