A trivial function I'm compiling with gcc and clang:
void test() {
printf("hm");
printf("hum");
}
$ gcc test.c -fomit-frame-pointer -masm=intel -O3 -S
sub rsp, 8
.cfi_def_cfa_offset 16
mov esi, OFFSET FLAT:.LC0
mov edi, 1
xor eax, eax
call __printf_chk
mov esi, OFFSET FLAT:.LC1
mov edi, 1
xor eax, eax
add rsp, 8
.cfi_def_cfa_offset 8
jmp __printf_chk
And
$ clang test.c -mllvm --x86-asm-syntax=intel -fomit-frame-pointer -O3 -S
# BB#0:
push rax
.Ltmp1:
.cfi_def_cfa_offset 16
mov edi, .L.str
xor eax, eax
call printf
mov edi, .L.str1
xor eax, eax
pop rdx
jmp printf # TAILCALL
The difference I'm interested in is that gcc uses sub rsp, 8
/add rsp, 8
for the function prolog and clang uses push rax
/pop rdx
.
Why does the compilers use different function prologues? Which variant is better? push
and pop
certainly encodes to shorter instructions but are they faster or slower than add
and sub
?
The reason for the stack fiddling in the first place seems to be that the abi requires rsp to be 16 bytes aligned for non leaf procedures. I haven't been able to find any compiler flags that removes them.
Judging from your answers, it seems like push & pop is better. push rax + pop rdx = 1 + 1 = 2
vs. sub rsp, 8 + add rsp, 8 = 4 + 4 = 8
. So the former pair saves 6 bytes at no expense.
(%rsp + 8)
is 16B aligned. (I editted most of this comment into my answer). – Peter Cordes