Optimizing a NEON XOR implementation

4

votes

Trying to xor a huge uint32 array I decided to use NEON coprocessor.

I implemented two c versions:

version 1:

uint32_t xor_array_ver_1(uint32_t *array, int size)
{
    uint32x2_t acc = vmov_n_u32(0);
    uint32_t acc1 = 0;
    for (; size != 0; size -= 2) {
        uint32x2_t vec;
        vec = vld1_u32(array);
        array += 2;
        acc = veor_u32(acc, vec);
    }
    acc1 = vget_lane_u32(acc,0) ^ vget_lane_u32(acc,1);
    return acc1;
}

version 2:

uint32_t xor_array_ver_2(uint32_t *array, int size)
{
    uint32x4_t acc = vmovq_n_u32(0);
    uint32_t acc1 = 0;

    for (; size != 0; size -= 4) {
        uint32x4_t vec;
        vec = vld1q_u32(array);
        array += 4;
        acc = veorq_u32(acc, vec);
    }

    acc1 ^= vgetq_lane_u32(acc,0);
    acc1 ^= vgetq_lane_u32(acc,1);
    acc1 ^= vgetq_lane_u32(acc,2);
    acc1 ^= vgetq_lane_u32(acc,3);

    return acc1;
}

Comparing the above 2 versions to the traditional xor implementation:

for (i=0; i<arr_size; i++)
        val ^= my_array[i];

I observed 2 issues:

Version 1 has the same performance.
Version 2 is s bit more than 30% better.

Can I rewrite it to be even better? where my_array is declared as uint32_t my_array[BIG_LENGTH];
Is there a non-NEON way I can improve the performance of the regular xoring code? unrolling the loop doesn't give any improvement.

c optimizationarmneoncpu-cache

Have you tried increasing the alignment of the data? For instance, making it a union with an array of uint32x4_t??? – technosaurus

@technosaurus yes it's aligned. The assumption is that the data is aligned. – 0x90

Aligned to int isn't the same. If you align data to 128bit instead of 32bit, you can cut down on cache misses and use aligned loads. If you're only aligned to int, then you may end up doing an unaligned load across cache lines... a double performance penalty. – technosaurus

@technosaurus it's aligned to PAGE_SIZE. Your comment is true in general though. – 0x90

5

votes

Most likely this will be memory bandwidth limited - once you saturate the available DRAM bandwidth, which should be quite easy to do with only one ALU operation per load, you won't get any further benefit from optimisation.

Try to combine your XOR with another operation on the same data if possible - that way you amortise the cost of the cache misses.

2

votes

A lengthly answer without any code snippets.

Hardware limits

First you should ask yourself what do I expect? Do you want to write the fastest code possible? How can you verify that? Start with for example writing some tests on what your hardware can achieve. As people pointed this will be mostly memory bandwidth limited, but then you need to know how fast your memory interface is. Figure out your platform's L1, L2 and ram capacity / performance characteristics, then you'll know what you can expect at most for different buffer sizes.

Compiler

Are you using latest compiler? Following question then is, are you using tools available to you at their best? Most of the compilers do not aggressively try to optimize your code, unless you told so. Are you configuring them for your best gain? Are you enabling full optimization (gcc: -O3), vectorization (gcc: -ftree-vectorize -ftree-vectorizer-verbose=1)? Do you set right configuration flags for your platform (-mcpu -mfpu)?

Are you verifying object code generated by compiler? For such a simple loop this would be very easy and help you try many configuration options and check the code produced.

Tweaks

Are you checking if using restricted pointers improves the performance?

What about alignment information? (For example you don't mention in your intrinsics examples but they expect size to be a multiply of 2 or 4 and of course that with usage of quad registers can create %30 improvement.)

What also about trying alignment on cache line size?

Hardware capabilities

Do you know what your hardware is capable of? For example Cortex-A9 is introduced as "Out-of-order speculative issue superscalar". Can you take advantage of having dual issue capabilities?

So the answer is somewhere between "it depends" and "you need to experiment".

2

votes

It's well known fact that neon intrinsics on gcc suck badly. Not sure if it was improved, but doing the same task in asm should give you way better improvement that 30% over plain c. You probably need to unroll the inner loop first of all. An easy way of transforming intrinsics to proper asm is to use armcc (compiler from arm) that works with intrinsics.

So, first try to unroll your plain c version (pseudo code):

for (i=arr_size; i<arr_size; i -= 4)
{
    val1 ^= my_array[0];
    val2 ^= my_array[1];
    val1 ^= my_array[2];
    val2 ^= my_array[3];
    my_array += 4;
}

doing something like that with neon should give you better results. Eventually, you should switch to neon asm, it's quite simple (Personally, I find it easier to write than the intrinsics).

Here's NEON asm suggestion (It's untested, up to you to figure out how to assemble it)

//data has to be suitably aligned (it has to be 8 or 16 byte aligned, not sure).
//dataSize in bytes has to be multiple of 64 and has to be at least 128.
//function does xor of uint32_t values and returns the result.
unsigned xor_array_64(const void *data, int dataSize);

xor_array_64:
      vldm r0!,{d0-d7}
      subs r1,r1,#0x40
0:
      pld [r0, #0xC0]
      vldm r0!,{d16-d23}
      veor q0, q0, q8
      veor q1, q1, q9
      veor q2, q2, q10
      veor q3, q3, q11
      subs r1,r1,#0x40
      bge 0b

      veor q0, q0, q1
      veor q2, q2, q3
      veor q0, q0, q2
      veor d0, d0, d1

      vtrn.32 d1, d0
      veor d0, d0, d1

      vmov r0, s0
      bx lr

1

votes

I don't write for ARM, and I'm not familiar with NEON at all, but I had the following thought, which is dependent on ARM NEON being a pipelined architecture, which I don't know if it is....

If Paul R is correct about your memory bandwidth being saturated, this may have little if any benefit, but what if you slightly restructured your code as follows.....

uint32_t xor_array_ver_2(uint32_t *array, int size)
{
  // Caveat:  'size' must be a positive multiple of 4, otherwise this
  //          code will loop for a very long time... and almost certainly
  //          segfault (or whatever term your system uses).

  uint32x4_t acc = vmovq_n_u32(0);
  uint32x4_t next_vec = vld1q_u32(array);
  uint32_t acc1 = 0;

  for (size-=4, array+=4; size != 0; size-=4) {
     uint32x4_t vec = next_vec;
     array += 4;
     next_vec = vld1q_u32(array);
     acc = veorq_u32(acc, vec);
  }
  acc = veorq_u32(acc, next_vec);

  acc1 ^= vgetq_lane_u32(acc,0);
  acc1 ^= vgetq_lane_u32(acc,1);
  acc1 ^= vgetq_lane_u32(acc,2);
  acc1 ^= vgetq_lane_u32(acc,3);

  return acc1;
}

....with the goal of getting started on the load of the next vector element before it's needed for the following loop.

Another slight twist you might try is this:

uint32_t xor_array_ver_2(uint32_t *array, int size)
{
  // Caveat:  'size' must be a positive multiple of 4, otherwise this
  //          code will loop for a very long time... and almost certainly
  //          segfault (or whatever term your system uses).

  uint32x4_t acc = vmovq_n_u32(0);
  uint32x4_t next_vec = vld1q_u32(&array[size-4]);
  uint32_t acc1 = 0;

  for (size-=8; size>=0; size-=4) {
     uint32x4_t vec = next_vec;
     next_vec = vld1q_u32(&array[size]);
     acc = veorq_u32(acc, vec);
  }
  acc = veorq_u32(acc, next_vec);

  acc1 ^= vgetq_lane_u32(acc,0);
  acc1 ^= vgetq_lane_u32(acc,1);
  acc1 ^= vgetq_lane_u32(acc,2);
  acc1 ^= vgetq_lane_u32(acc,3);

  return acc1;
}

Optimizing a NEON XOR implementation

4 Answers

Hardware limits

Compiler

Tweaks

Hardware capabilities