NEON SIMD dotproduct not faster on ARM Cortex A53

Question

I'm trying to implement an application that needs to calculate the dotproduct of some arrays. This needs to be very fast, so I thought about testing SIMD with Neon. I was able to rewrite my function to use SIMD but the meassured time is nearly the same as before and sometimes a bit more. Without SIMD like 31 seconds and with SIMD 32 seconds.

Here is my Code with SIMD:

        float output = 0.0;

    for (int i=0; i<NUMBER_OF_INPUTS; i+=4)
    {
        in1_126 = vld1q_f32(&source[i]);
        in2_126 = vld1q_f32(&weights[i]);
        out_126 = vmulq_f32(in1_126, in2_126);
        output +=  vaddvq_f32(out_126);
    }

    return output;

and here without:

    float output = 0.0;
    float tmp;



    for(unsigned int i = 0; i < NUMBER_OF_INPUTS; i++)
    {
        tmp = source[i] * weights[i];
        output += tmp;  
    }

    return output;

I have set those compiler flags:

-mcpu=cortex-a53 -march=armv8-a+simd+crypto

but it doesnt change anything.

Why is there nearly no difference in timing? Or is using NEON the wrong way to go to make my dotproduct faster? Do you have any other ideas to make it faster?

Thanks for any reply!

Did you check to see whether the compiler is already vectorizing the scalar code ? Also are you using -O3 ? — Paul R
I'm not sure how to check this, but it may be the case. I didn't use -O3 but now I tried using it and it gives the same result. — J.Ney
Use objdump to view the disassembled output, or godbolt.org to play around interactively. — Paul R

Jake 'Alquimista' LEE Jake 'Alquimista' LEE · Accepted Answer · 2017-12-22T10:11:49

You shouldn't move from a vector register to scalar one within loops.

It will cause a pipeline flush and cost you roughly 14 cycles each time it occurs. (on ARMv7-A)

How many cycles these are depends on the specific architecture.

What you can try:

out126 = vmovq_n_f32(0.0f);
for (int i=0; i<NUMBER_OF_INPUTS; i+=4)
{
  in1_126 = vld1q_f32(&source[i]);
  in2_126 = vld1q_f32(&weights[i]);
  out_126 = vmlaq_f32(out_126, in1_126, in2_126);
}

output =  vaddvq_f32(out_126);

NEON SIMD dotproduct not faster on ARM Cortex A53

1 Answers