Empty __asm__
statements are not enough: better use data dependencies
Like this:
main.c
int main(void) {
unsigned i;
for (i = 0; i < 10; i++) {
__asm__ volatile("" : "+g" (i) : :);
}
}
Compile and disassemble:
gcc -O3 -ggdb3 -o main.out main.c
gdb -batch -ex 'disas main' main.out
Output:
0x0000000000001040 <+0>: xor %eax,%eax
0x0000000000001042 <+2>: nopw 0x0(%rax,%rax,1)
0x0000000000001048 <+8>: add $0x1,%eax
0x000000000000104b <+11>: cmp $0x9,%eax
0x000000000000104e <+14>: jbe 0x1048 <main+8>
0x0000000000001050 <+16>: xor %eax,%eax
0x0000000000001052 <+18>: retq
I believe that this is robust, because it places an explicit data dependency on the loop variable i
as suggested at: Enforcing statement order in C++ and produces the desired loop:
This marks i
as an input and output of inline assembly. Then, inline assembly is a black box for GCC, which cannot know how it modifies i
, so I think that really can't be optimized away.
If I do the same with an empty __asm__
as in:
bad.c
int main(void) {
unsigned i;
for (i = 0; i < 10; i++) {
__asm__ volatile("");
}
}
it appears to completely remove the loop and outputs:
0x0000000000001040 <+0>: xor %eax,%eax
0x0000000000001042 <+2>: retq
Also note that __asm__("")
and __asm__ volatile("")
should be the same since there are no output operands: The difference between asm, asm volatile and clobbering memory
What is happening becomes clearer if we replace it with:
__asm__ volatile("nop");
which produces:
0x0000000000001040 <+0>: nop
0x0000000000001041 <+1>: nop
0x0000000000001042 <+2>: nop
0x0000000000001043 <+3>: nop
0x0000000000001044 <+4>: nop
0x0000000000001045 <+5>: nop
0x0000000000001046 <+6>: nop
0x0000000000001047 <+7>: nop
0x0000000000001048 <+8>: nop
0x0000000000001049 <+9>: nop
0x000000000000104a <+10>: xor %eax,%eax
0x000000000000104c <+12>: retq
So we see that GCC just loop unrolled the nop
loop in this case because the loop was small enough.
So, if you rely on an empty __asm__
, you would be relying on hard to predict GCC binary size/speed tradeoffs, which if applied optimally, should would always remove the loop for an empty __asm__ volatile("");
which has code size zero.
noinline
busy loop function
If the loop size is not known at compile time, full unrolling is not possible, but GCC could still decide to unroll in chunks, which would make your delays inconsistent.
Putting that together with Denilson's answer, a busy loop function could be written as:
void __attribute__ ((noinline)) busy_loop(unsigned max) {
for (unsigned i = 0; i < max; i++) {
__asm__ volatile("" : "+g" (i) : :);
}
}
int main(void) {
busy_loop(10);
}
which disassembles at:
Dump of assembler code for function busy_loop:
0x0000000000001140 <+0>: test %edi,%edi
0x0000000000001142 <+2>: je 0x1157 <busy_loop+23>
0x0000000000001144 <+4>: xor %eax,%eax
0x0000000000001146 <+6>: nopw %cs:0x0(%rax,%rax,1)
0x0000000000001150 <+16>: add $0x1,%eax
0x0000000000001153 <+19>: cmp %eax,%edi
0x0000000000001155 <+21>: ja 0x1150 <busy_loop+16>
0x0000000000001157 <+23>: retq
End of assembler dump.
Dump of assembler code for function main:
0x0000000000001040 <+0>: mov $0xa,%edi
0x0000000000001045 <+5>: callq 0x1140 <busy_loop>
0x000000000000104a <+10>: xor %eax,%eax
0x000000000000104c <+12>: retq
End of assembler dump.
Here the volatile
is was needed to mark the assembly as potentially having side effects, since in this case we have an output variables.
A double loop version could be:
void __attribute__ ((noinline)) busy_loop(unsigned max, unsigned max2) {
for (unsigned i = 0; i < max2; i++) {
for (unsigned j = 0; j < max; j++) {
__asm__ volatile ("" : "+g" (i), "+g" (j) : :);
}
}
}
int main(void) {
busy_loop(10, 10);
}
GitHub upstream.
Related threads:
Tested in Ubuntu 19.04, GCC 8.3.0.
volatile asm ("rep; nop;")
to busy-pause by wasting CPU cycles that do nothing? – user405725avr-libc
has no sleep function that just waits some time. Instead, it maps to CPUsleep
instruction, which starts one of the low-power modes (effectively stopping the CPU). Good idea, nevertheless. – Denilson Sá Maia