1
votes

I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.

  void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{                             
  unsigned int ixy;
  ixy = xsize * ysize;
  while (ixy--)                 
    *(s++) &= *(m++);
}

Probably I have to use the following commands:

vld1q_u32 // to load 4 integers from s and m

vandq_u32 // to execute logical and between the 4 integers from s and m

vst1q_u32 // to store them back into s

However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.

I am using gcc 4.8.1 and I am compiling with the following cmd:

arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name

Thanks in advance

2
I can help you with this advice : turn off auto-vectorization with -fno-tree-vectorize. And STAY AWAY from intrinsics unless you want to spend more time debugging than coding. Go for assembly if you need NEON for your purposes.Jake 'Alquimista' LEE
Thanks for the response. So you suggest that writing a function in assembly is more efficient than intrinsics? I thought that intrinsics map to specific assembly instructions and thus it was very similar to writing assembly. What kind of problems are caused by intrinsics???Nick
Since Linaro took over GCC, it got much better than before where Intrinsics generated codes were simply crap. Now, you might get decent performance with intrinsics when dealing with simple examples. However, when it comes to real field usage where lots of registers are required, especially when they are permuted, intrinsics does lots of obscure things like transfering data between registers unnecessarily.Jake 'Alquimista' LEE

2 Answers

2
votes

I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.

void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{                             
  unsigned int ixy;
  uint32x4_t srcA,srcB,srcC,srcD;
  uint32x4_t maskA,maskB,maskC,maskD;

  ixy = xsize * ysize;
  ixy /= 16; // process 16 at a time
  while (ixy--)
  {
    __builtin_prefetch(&s[64]); // preload the cache
    __builtin_prefetch(&m[64]);
    srcA = vld1q_u32(&s[0]);
    maskA = vld1q_u32(&m[0]);
    srcB = vld1q_u32(&s[4]);
    maskB = vld1q_u32(&m[4]);
    srcC = vld1q_u32(&s[8]);
    maskC = vld1q_u32(&m[8]);
    srcD = vld1q_u32(&s[12]);
    maskD = vld1q_u32(&m[12]);
    srcA = vandq_u32(srcA, maskA); 
    srcB = vandq_u32(srcB, maskB); 
    srcC = vandq_u32(srcC, maskC); 
    srcD = vandq_u32(srcD, maskD);
    vst1q_u32(&s[0], srcA);
    vst1q_u32(&s[4], srcB);
    vst1q_u32(&s[8], srcC);
    vst1q_u32(&s[12], srcD);
    s += 16;
    m += 16;
  }
}
0
votes

I would start with the simplest one and take it as a reference for compare with future routines.

A good rule of thumb is to calculate needed things as soon as possible, not exactly when needed. This means that instructions can take X cycles to execute, but the results are not always immediately ready, so scheduling is important

As an example, a simple scheduling schema for your case would be (pseudocode)

nn=n/4  // Assuming n is a multiple of 4

LOADI_S(0)  // Load and immediately after increment pointer
LOADI_M(0)  // Load and immediately after increment pointer
for( k=1; k<nn;k++){
   AND_SM(k-1)    // Inner op
   LOADI_S(k)     // Load and increment after
   LOADI_M(k)     // Load and increment after
   STORE_S(k-1)  // Store and increment after
}
AND_SM(nn-1)
STORE_S(nn-1)     // Store. Not needed to increment

Leaving out these instructions from the inner loop we achieve that the ops inside don't depend on the result of the previous op. This schema can be further extended in order to take profit of the time that otherwise would be lost waiting for the result of the previous op.

Also, as intrinsics still depend on the optimizer, see what does the compiler do under different optimization options. I prefer to use inline assembly, which is not difficult for small routines, and give you more control.