Optimizing mask function with ARM SIMD instructions

Question

I was wondering if you could help me use NEON intrinsics to optimize this mask function. I already tried to use auto-vectorization using the O3 gcc compiler flag but the performance of the function was smaller than running it with O2, which turns off the auto-vectorization. For some reason the assembly code produced with O3 is 1,5 longer than the one with O2.

  void mask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{                             
  unsigned int ixy;
  ixy = xsize * ysize;
  while (ixy--)                 
    *(s++) &= *(m++);
}

Probably I have to use the following commands:

vld1q_u32 // to load 4 integers from s and m

vandq_u32 // to execute logical and between the 4 integers from s and m

vst1q_u32 // to store them back into s

However i don't know how to do it in the most optimal way. For instance should I increase s,m by 4 after loading , anding and storing? I am quite new to NEON so I would really need some help.

I am using gcc 4.8.1 and I am compiling with the following cmd:

arm-linux-gnueabihf-gcc -mthumb -march=armv7-a -mtune=cortex-a9 -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon -O3 -fprefetch-loop-arrays name.c -o name

Thanks in advance

I can help you with this advice : turn off auto-vectorization with -fno-tree-vectorize. And STAY AWAY from intrinsics unless you want to spend more time debugging than coding. Go for assembly if you need NEON for your purposes. — Jake 'Alquimista' LEE
Thanks for the response. So you suggest that writing a function in assembly is more efficient than intrinsics? I thought that intrinsics map to specific assembly instructions and thus it was very similar to writing assembly. What kind of problems are caused by intrinsics??? — Nick
Since Linaro took over GCC, it got much better than before where Intrinsics generated codes were simply crap. Now, you might get decent performance with intrinsics when dealing with simple examples. However, when it comes to real field usage where lots of registers are required, especially when they are permuted, intrinsics does lots of obscure things like transfering data between registers unnecessarily. — Jake 'Alquimista' LEE

BitBank BitBank · Accepted Answer · 2014-05-13T16:13:54

I would probably do it like this. I've included 4x loop unrolling. Preloading the cache is always a good idea and can speed things up another 25%. Since there's not much processing going on (it's mostly spending time loading and storing), it's best to load lots of registers, then process them as it gives time for the data to actually load. It assumes the data is an even multiple of 16 elements.

void fmask(unsigned int x, unsigned int y, uint32_t *s, uint32_t *m)
{                             
  unsigned int ixy;
  uint32x4_t srcA,srcB,srcC,srcD;
  uint32x4_t maskA,maskB,maskC,maskD;

  ixy = xsize * ysize;
  ixy /= 16; // process 16 at a time
  while (ixy--)
  {
    __builtin_prefetch(&s[64]); // preload the cache
    __builtin_prefetch(&m[64]);
    srcA = vld1q_u32(&s[0]);
    maskA = vld1q_u32(&m[0]);
    srcB = vld1q_u32(&s[4]);
    maskB = vld1q_u32(&m[4]);
    srcC = vld1q_u32(&s[8]);
    maskC = vld1q_u32(&m[8]);
    srcD = vld1q_u32(&s[12]);
    maskD = vld1q_u32(&m[12]);
    srcA = vandq_u32(srcA, maskA); 
    srcB = vandq_u32(srcB, maskB); 
    srcC = vandq_u32(srcC, maskC); 
    srcD = vandq_u32(srcD, maskD);
    vst1q_u32(&s[0], srcA);
    vst1q_u32(&s[4], srcB);
    vst1q_u32(&s[8], srcC);
    vst1q_u32(&s[12], srcD);
    s += 16;
    m += 16;
  }
}

Optimizing mask function with ARM SIMD instructions

2 Answers