I'm trying to implement an application that needs to calculate the dotproduct of some arrays. This needs to be very fast, so I thought about testing SIMD with Neon. I was able to rewrite my function to use SIMD but the meassured time is nearly the same as before and sometimes a bit more. Without SIMD like 31 seconds and with SIMD 32 seconds.
Here is my Code with SIMD:
float output = 0.0;
for (int i=0; i<NUMBER_OF_INPUTS; i+=4)
{
in1_126 = vld1q_f32(&source[i]);
in2_126 = vld1q_f32(&weights[i]);
out_126 = vmulq_f32(in1_126, in2_126);
output += vaddvq_f32(out_126);
}
return output;
and here without:
float output = 0.0;
float tmp;
for(unsigned int i = 0; i < NUMBER_OF_INPUTS; i++)
{
tmp = source[i] * weights[i];
output += tmp;
}
return output;
I have set those compiler flags:
-mcpu=cortex-a53 -march=armv8-a+simd+crypto
but it doesnt change anything.
Why is there nearly no difference in timing? Or is using NEON the wrong way to go to make my dotproduct faster? Do you have any other ideas to make it faster?
Thanks for any reply!
-O3
? – Paul Robjdump
to view the disassembled output, or godbolt.org to play around interactively. – Paul R