Using an ARM Cortex A15 board I'm trying to optimize a perfectly working C code by using NEON intrinsics.
compiler: gcc 4.7 on ubuntu 12.04
Flags:-g -O3 -mcpu=cortex-a15 -mfpu=neon-vfpv4 -ftree-vectorize -DDRA7XX_ARM -DARM_PROC -DSL -funroll-loops -ftree-loop-ivcanon -mfloat-abi=hard
I wanted to do the following function ,its just a simple load->multiply->store.
here are some parameters: *input is a pointer to an array of size 40680 and after completing the loop the pointer should retain the current position and do the same for next input stream via input pointer.
float32_t A=0.7;
float32_t *ptr_op=(float*)output[9216];
float32x2_t reg1;
for(i= 0;i< 4608;i+=4){
/*output[(2*i)] = A*(*input); // C version
input++;
output[(2*i)+1] = A*(*input);
input++;*/
reg1=vld1q_f32(input++); //Neon version
R_N=vmulq_n_f32(reg1,A);
vst1q_f32(ptr_op++,R_N);
}
I want to understand where am I making mistake in this loop because it seems pretty straightforward.
Here is my assembly implementation of the same . Am I going in the correct direction???
__asm__ __volatile__(
"\t mov r4, #0\n"
"\t vdup.32 d1,%3\n"
"Lloop2:\n"
"\t cmp r4, %2\n"
"\t bge Lend2\n"
"\t vld1.32 d0, [%0]!\n"
"\t vmul.f32 d0, d0, d1\n"
"\t vst1.32 d0, [%1]!\n"
"\t add r4, r4, #2\n"
"\t b Lloop2\n"
"Lend2:\n"
: "=r"(input), "=r"(ptr_op), "=r"(length), "=r"(A)
: "0"(input), "1"(ptr_op), "2"(length), "3"(A)
: "cc", "r4", "d1", "d0");