Arm GNU Compiler: Assembly generated from ternary optimized by superfluous cast

Question

(Updated to remove decltype and replace with static_cast, same results)

In the code example, Adding in the cast in the macro MAX, the code performs faster. I can't figure out why as it seems that it should be identical. This happens with two different ARM compilers, GCC (and armclang in larger codebase). Any thoughts on this would be very helpful.

In the code below, when defining WITH_CAST, the results of the compilation are significantly improved (with identical results in my larger codebase). The cast performed appears to be superfluous. I am running this within Keil 5.25pre2 (only as a simulator). I've used Keil simulator to check performance speed, by looking at what the t1 timer shows in terms of micro-seconds passed.

Snippet from code:

#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (static_cast<mytype>(a)) : (static_cast<mytype>(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif

GNU Arm Tools Embedded v. 7 2017-q4-major.

Compiler options: -c -mcpu=cortex-m4 -mthumb -gdwarf-2 -MD -Wall -O -mapcs-frame -mthumb-interwork -std=c++14 -Ofast -I./RTE/_Target_1 -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/CMSIS/Include -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/Device/ARM/ARMCM4/Include -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/lib/gcc/arm-none-eabi/7.2.1/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1/arm-none-eabi" -D__UVISION_VERSION="525" -D__GCC -D__GCC_VERSION="721" -D_RTE_ -DARMCM4 -Wa,-alhms="*.lst" -o *.o

Assembler options: -mcpu=cortex-m4 -mthumb --gdwarf-2 -mthumb-interwork --MD .d -I./RTE/_Target_1 -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/CMSIS/Include -IC:/Keil_v525pre/ARM/PACK/ARM/CMSIS/5.2.0/Device/ARM/ARMCM4/Include -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/lib/gcc/arm-none-eabi/7.2.1/include" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1" -I"C:/Program Files (x86)/GNU Tools ARM Embedded/7 2017-q4-major/arm-none-eabi/include/c++/7.2.1/arm-none-eabi" -alhms=".lst" -o *.o

Linker options: -T ./RTE/Device/ARMCM4/gcc_arm.ld -mcpu=cortex-m4 -mthumb -mthumb-interwork -Wl,-Map="./Optimization.map" -o Optimization.elf *.o -lm

#include <cstdlib>
#include <cstring>
#include <cstdint>

#define WITH_CAST
struct mytype {
 uint32_t value;
 __attribute__((const, always_inline)) constexpr friend bool operator>(const mytype & t, const mytype & a) {
  return t.value > a.value;
 }
};
static mytype output_buf [32];
static mytype * output_memory_ptr = output_buf;
static mytype * volatile * output_memory_tmpp = &output_memory_ptr;
static mytype input_buf [32];
static mytype * input_memory_ptr = input_buf;
static mytype * volatile * input_memory_tmpp = &input_memory_ptr;
#if defined (WITH_CAST)
#define MAX(a,b) (((a) > (b)) ? (static_cast<mytype>(a)) : (static_cast<mytype>(b)))
#else
#define MAX(a,b) (((a) > (b)) ? ((a)) : ((b)))
#endif
int main (void) {
 const mytype * input = *input_memory_tmpp;
 mytype * output = *output_memory_tmpp;
 mytype p = input[0];
 mytype c = input[1];
 mytype pc = MAX(p, c);
 output[0] = pc;
 for (int i = 1; i < 31; i ++) {
  mytype n = input[i + 1];
  mytype cn = MAX(c, n);
  output[i] = MAX(pc, cn);
  p = c;
  c = n;
  pc = cn;
 }
 output[31] = pc;
}

The cast is not superfluous - it is a cast to a non-reference type, which means that now the result of ?: is not an lvalue. Without that cast the result of ?: is lvalue. Now, why it makes your code faster is a different story... — AnT
That leads somewhere. Changing static_cast<mytype> to static_cast<mytype&> produces the same suboptimal results and of course attempting to assign to result of MAX only works with mytype& due to lvalue as you've pointed out. Still can't figure out why the compiler can't identify and optimize for how the variable is being used. — nachum
@AnT Add your comment as an answer and I'll accept it. It explained what was happening - of course not why, but that seems like a question for the compiler writers. — nachum

gzh gzh · Accepted Answer · 2018-02-05T04:51:59

Quotation from C++0x specification:

The type denoted by decltype(e) is defined as follows:

— if e is an unparenthesized id-expression or a class member access (5.2.5), decltype(e) is the type of the entity named by e. If there is no such entity, or if e names a set of overloaded functions, the program is ill-formed;

— otherwise, if e is a function call (5.2.2) or an invocation of an overloaded operator (parentheses around e are ignored), decltype(e) is the return type of the statically chosen function;

— otherwise, if e is an lvalue, decltype(e) is T&, where T is the type of e;

— otherwise, decltype(e) is the type of e.

I guess the usage of reference (T&) make it more efficient.

From discussion result in WANT SPEED? DON’T (ALWAYS) PASS BY VALUE.

Involving only lvalues, in the absence of move semantics, the “pass by value” version results in one extra object being created, via a copy construction.

Therefore, the use of `decltype', i.e. 'pass by reference' improved the efficiency of your code.

Arm GNU Compiler: Assembly generated from ternary optimized by superfluous cast

1 Answers