I am trying to learn more about assembly and which optimizations compilers can and cannot do.
I have a test piece of code for which I have some questions.
See it in action here: https://godbolt.org/z/pRztTT, or check the code and assembly below.
#include <stdio.h>
#include <string.h>
int main(int argc, char* argv[])
{
for (int j = 0; j < 100; j++) {
if (argc == 2 && argv[1][0] == '5') {
printf("yes\n");
}
else {
printf("no\n");
}
}
return 0;
}
The assembly produced by GCC 10.1 with -O3:
.LC0:
.string "no"
.LC1:
.string "yes"
main:
push rbp
mov rbp, rsi
push rbx
mov ebx, 100
sub rsp, 8
cmp edi, 2
je .L2
jmp .L3
.L5:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
je .L4
.L2:
mov rax, QWORD PTR [rbp+8]
cmp BYTE PTR [rax], 53
jne .L5
mov edi, OFFSET FLAT:.LC1
call puts
sub ebx, 1
jne .L2
.L4:
add rsp, 8
xor eax, eax
pop rbx
pop rbp
ret
.L3:
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
je .L4
mov edi, OFFSET FLAT:.LC0
call puts
sub ebx, 1
jne .L3
jmp .L4
It seems like GCC produces two versions of the loop: one with the argv[1][0] == '5'
condition but without the argc == 2
condition, and one without any condition.
My questions:
- What is preventing GCC from splitting away the full condition? It is similar to this question, but there is no chance for the code to get a pointer into argv here.
- In the loop without any condition (L3 in assembly), why is the loop body duplicated? Is it to reduce the number of jumps while still fitting in some sort of cache?
printf
won't modify memory pointed-to byargv
. It would need special rules formain
andprintf
/puts
to know that thatchar **
arg won't ever point directly or indirectly point to memory that some non-inline function call namedputs
might modify. Re: unrolling: that's odd,-funroll-loops
isn't on by default for GCC at-O3
, only with-O3 -fprofile-use
– Peter Cordesargv[1][0]
into a localchar
variable first, GCC does move the full condition outside the loop. Would (theoretically) compilingputs()
together with thismain()
allow the compiler to seeputs()
isn't touchingargv
and optimize the loop fully? – Tomas Creemerswrite
function that uses an inlineasm
statement around asyscall
instruction, with a memory input operand (and no"memory"
clobber) then it could inline. (Or maybe do inter-procedural optimization without inlining.) – Peter Cordes-freorder-blocks-algorithm=stc
: ‘stc
’, the “software trace cache” algorithm, which tries to put all often executed code together, minimizing the number of branches executed by making extra copies of code. – Tomas Creemers