0
votes

I'm trying to learn more about ARM assembly and understand what exactly is happening behind the scenes with NEON intrinsics. I'm using the latest Xcode LLVM compiler. I find that often, the assembly produced from intrinsics is actually slower than even plain naive C code.

For example this code:

void ArmTest::runTest()
{

    const float vector[4] = {1,2,3,4};
    float result[4];
    float32x4_t vA = vld1q_f32(vector);

    asm("#Begin Test");

    vA = vmulq_f32(vA, vA);

    asm("#End Test");

    vst1q_f32(result, vA);
}

Produces this output:

#Begin Test

ldr q0, [sp, #16]
stp q0, q0, [fp, #-48]
ldur    q1, [fp, #-32]
fmul.4s v0, v1, v0
str q0, [sp, #16]

#End Test

What I fail to understand is why all the loads/stores hitting the memory here? I must be missing something obvious, right? Also, how would one write this in inline assembly so that it is optimal? I would expect just a single instruction, but the output is way different.

Please help me understand.

Thanks!

1
You should never trust compilers. Looking at your example, especially 64-bit codes are far far from halfway decent. Stop wasting time with intrinsics and checking the generated codes, but learn assembly instead. You won't get very far with intrinsics without knowing the actual instructions anyway. - Jake 'Alquimista' LEE
You should be also aware that accessing the same memory area with ARM and NEON wastes many many cycles while "switching". Therefore, if you write test functions, you should let them handle large chunk of data through multiple iterations. In your example, vector is probably initialized by ARM on the stack, and read and computed by NEON - very slow indeed. - Jake 'Alquimista' LEE
void square(float * pDst, float * pSrc, unsigned int size); <== for example. - Jake 'Alquimista' LEE

1 Answers

2
votes

You need a better test. Your test doesn't use the results of your calculation for anything so the compiler is just going through the motions to make you happy. It looks like you're compiling with -O0 which will produce a bunch of unnecessary loads and stores for debugging purposes. If you compiled with -O3, all of your code would be stripped out. I rewrote your test to preserve the results and compiled with -O3 and here are the results:

$ cat neon.c 
#include <arm_neon.h>

void runTest(const float vector[], float result[])
{
    float32x4_t vA = vld1q_f32(vector);
     vA = vmulq_f32(vA, vA);
     vst1q_f32(result, vA);
}

$ xcrun -sdk iphoneos clang -arch arm64 -S neon.c -O3
$ cat neon.s 
    .section    __TEXT,__text,regular,pure_instructions
    .globl  _runTest
    .align  2
_runTest:                               ; @runTest
; BB#0:
    ldr q0, [x0]
    fmul.4s v0, v0, v0
    str q0, [x1]
    ret lr

This code looks optimal