Inline NOPs not optimized out in LLVM

Question

I'm working through an example in this overview of compiling inline ARM assembly using GCC. Rather than GCC, I'm using llvm-gcc 4.2.1, and I'm compiling the following C code:

#include <stdio.h>
int main(void) {
    printf("Volatile NOP\n");
    asm volatile("mov r0, r0");
    printf("Non-volatile NOP\n");
    asm("mov r0, r0");
    return 0;
}

Using the following commands:

llvm-gcc -emit-llvm -c -o compiled.bc input.c
llc -O3 -march=arm -o output.s compiled.bc

My output.s ARM ASM file looks like this:

    .syntax unified
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .file   "compiled.bc"
    .text
    .globl  main
    .align  2
    .type   main,%function
main:                                   @ @main
@ BB#0:                                 @ %entry
    str lr, [sp, #-4]!
    sub sp, sp, #16
    str r0, [sp, #12]
    ldr r0, .LCPI0_0
    str r1, [sp, #8]
    bl  puts
    @APP
    mov r0, r0
    @NO_APP
    ldr r0, .LCPI0_1
    bl  puts
    @APP
    mov r0, r0
    @NO_APP
    mov r0, #0
    str r0, [sp, #4]
    str r0, [sp]
    ldr r0, [sp, #4]
    add sp, sp, #16
    ldr lr, [sp], #4
    bx  lr
@ BB#1:
    .align  2
.LCPI0_0:
    .long   .L.str

    .align  2
.LCPI0_1:
    .long   .L.str1

.Ltmp0:
    .size   main, .Ltmp0-main

    .type   .L.str,%object          @ @.str
    .section    .rodata.str1.1,"aMS",%progbits,1
.L.str:
    .asciz   "Volatile NOP"
    .size   .L.str, 13

    .type   .L.str1,%object         @ @.str1
    .section    .rodata.str1.16,"aMS",%progbits,1
    .align  4
.L.str1:
    .asciz   "Non-volatile NOP"
    .size   .L.str1, 17

The two NOPs are between their respective @APP/@NO_APP pairs. My expectation is that the asm() statement without the volatile keyword will be optimized out of existence due to the -O3 flag, but clearly both inline assembly statements survive.

Why does the asm("mov r0, r0") line not get recognized and removed as a NOP?

When you write inline assembly, you get what you write. The compiler makes no attempt to optimize the inline assembly that you write. — Mysticial
@Mystical According to the linked article: "When adding assembly language code by using inline assembler statements, this code is also processed by the C compiler's code optimizer." Is the author incorrect, and if so, can you link me to documentation about how LLVM handles this? — Zeke
That just means that it can optimize the entire inline assembly statement. But it will not go inside it and try to mess with the instructions themselves. The optimizer will treat the entire inline assembly statement as a black-box. It can remove or duplicate it if no side-effects are specified. But since you specified volatile, it will assume that it has side-effects and will not even touch it. — Mysticial
If this is good enough, I can make it an answer. But my only experience with inline assembly is in GCC and ICC. So I'm not sure if it's any different with the LLVM optimizers. — Mysticial
volatile is not about not removing or removing asm statement. Volatile with asm block instructs compiler to not reorder asm statement with its neighbor statements. — Mārtiņš Možeiko

artless noise artless noise · Accepted Answer · 2013-02-28T01:55:02

As Mystical and Mārtiņš Možeiko have describe the compiler does not optimize the code; ie, change the instructions. What the compiler does optimize is when the instruction is scheduled. When you use volatile, then the compiler will not re-schedule. In your example, re-scheduling would be moving before or after the printf.

The other optimization the compiler might make is to get C values to register for you. Register allocation is very important to optimization. This doesn't optimize the assembler, but allow the compiler to do sensible things with other code with-in the function.

To see the effect of volatile, here is some sample code,

int example(int test, int add)
{
  int v1=5, v2=0;
  int i=0;
  if(test) {
    asm volatile("add %0, %1, #7" : "=r" (v2) : "r" (v2));
    i+= add * v1;
    i+= v2;
  } else {
    asm ("add %0, %1, #7" : "=r" (v2) : "r" (v2));
    i+= add * v1;
    i+= v2;
  }
  return i;
}

The two branches have identical code except for the volatile. gcc 4.7.2 generates the following code for an ARM926,

example:
   cmp  r0, #0
   bne  1f           /* branch if test set? */
   add  r1, r1, r1, lsl #2
   add  r0, r0, #7   /* add seven delayed */
   add  r0, r0, r1
   bx   lr
1: mov  r0, #0       /* test set */
   add  r0, r0, #7   /* add seven immediate */
   add  r1, r1, r1, lsl #2
   add  r0, r0, r1
   bx   lr

Note: The assembler branches are reversed to the 'C' code. The 2nd branch is slower on some processors due to pipe lining. The compiler prefers that

   add  r1, r1, r1, lsl #2
   add  r0, r0, r1

do not execute sequentially.

The Ethernut ARM Tutorial is an excellent resource. However, optimize is a bit of an overloaded word. The compiler doesn't analyze the assembler, only the arguments and where the code will be emitted.

Inline NOPs not optimized out in LLVM

2 Answers