I'm trying to learn more about ARM assembly and understand what exactly is happening behind the scenes with NEON intrinsics. I'm using the latest Xcode LLVM compiler. I find that often, the assembly produced from intrinsics is actually slower than even plain naive C code.
For example this code:
void ArmTest::runTest()
{
const float vector[4] = {1,2,3,4};
float result[4];
float32x4_t vA = vld1q_f32(vector);
asm("#Begin Test");
vA = vmulq_f32(vA, vA);
asm("#End Test");
vst1q_f32(result, vA);
}
Produces this output:
#Begin Test
ldr q0, [sp, #16]
stp q0, q0, [fp, #-48]
ldur q1, [fp, #-32]
fmul.4s v0, v1, v0
str q0, [sp, #16]
#End Test
What I fail to understand is why all the loads/stores hitting the memory here? I must be missing something obvious, right? Also, how would one write this in inline assembly so that it is optimal? I would expect just a single instruction, but the output is way different.
Please help me understand.
Thanks!