0
votes

I was trying to make my older code run faster as I discovered, that RPi 2 processor supports NEON instructions. So I wrote this code:

__asm__ __volatile__(
  "vld1.8 {%%d2, %%d3}, [%1];"
  "vld1.8 {%%d4, %%d5}, [%2];"
  "vaba.u8 %%q0, %%q1, %%q2;"
  "vst1.64 %%d0, [%0];"
  : "=r" (address_sad_intermediary)
  : "r" (address_big_pic), "r" (address_small_pic)
  :
);

Then in C the main sad variable is summed with sad_intermediary.

The main goal is to compute the sum of absolute differences, so I load 16 B from big_pic into q1 register, 16 B from small_pic into q2 register, calculate the SAD into q0, then load the lower 8 B from q0 into the intermediary variable. The problem is, that the resulting sad is zero.

I use GCC 4.9.2 with -std=c99 -pthread -O3 -lm -Wall -march=armv7-a -mfpu=neon-vfpv4 -mfloat-abi=hard options.

Do you see any problems with the code? Thanks.

1

1 Answers

1
votes

You never load anything into q0, so the vaba is adding the absolute difference to an uninitialised register. It also looks like you're not declaring which registers you're modifying.

But I don't know if that's the cause of your problem because I'm not too handy with inline assembly. You probably shouldn't be using inline assembly for something like this, though. If you use intrinsics then the compiler has greater ability to optimise the code. Something like this:

#include <arm_neon.h>

...
uint8x8_t s = vld1_u8(address_sad_intermediary);
s = vaba_u8(s, vld1_u8(address_big_pic), vld1_u8(address_small_pic));
vst1_u8(address_sad_intermediary, s);

(note that this code only works with eight bytes, because you only save eight bytes in your code)