0
votes

I'm a little stuck! I would like to optimize the following code with ARM NEON but I'm not sure how to do it.

uint8_t* srcPtr = src->get();
uint8_t* dstPtr = dst->get();

int i;
for(i=0; i< SIZE; i++){
   dstPtr++ = srcPtr[0];
   dstPtr++ = srcPtr[1];
   dstPtr++ = srcPtr[0];
   dstPtr++ = srcPtr[1];
   dstPtr++ = srcPtr[0];
   dstPtr++ = srcPtr[1];

   srcPtr+= 2;
}

Say if the srcPtr in uint8_t contains

0 1 2 3

the dstPtr would be

0 1 0 1 0 1 2 3 2 3 2 3

Can someone please help me ?

1

1 Answers

4
votes

Since you want to copy pairs of bytes, the easiest thing to do is to treat them as 16-bit values. Endianness doesn't matter so long as you load and store the same type, and if you remember to cast the pointers to void* then you don't have to worry about the compiler adding alignment hints (if you cast a pointer to uint16_t* then Clang will assume it's an aligned pointer and may add unsafe hints in some cases).

Since you're unrolling by a factor of 3, the easiest way to achieve that is using vst3. If it was a factor of 4 or 8 then you could use vdup instead, but not for threes.

Loop body should look something like this:

uint16x4_t v = vld1_u16((void *)src);
uint16x4x3_t v3 = { v, v, v };
vst3_u16((void *)dst, v3);
srcPtr += 8;
dstPtr += 24;