Arm-neon optimized version of SAD 16*4 not giving expected gain

Question

I wrote a 16*4 SAD function and its arm-neon optimized version. The arm-neon version is written in inline assembly. My problem is I am getting only 2x optimization ( with O3 enabled ), while ideally I should be getting at least 6x optimization out of it. Can anyone please explain the internals of what is happening ?

static unsigned int f_sad_16x4 ( const unsigned char* a, const unsigned char* b, const unsigned int uiStrideOrg, const unsigned int uiStrideCur )
{
  unsigned int sad = 0;
  for (int i = 0; i < 4; i++) 
  {
      for (int j = 0; j < 16; j++)
      {
          sad += abs(static_cast<int>(a[i*uiStrideOrg+j]) - static_cast<int>(b[i*uiStrideCur+j]));
      }
  }
  return sad;
}

static unsigned int f_sad_16x4_neon(const unsigned char* a, const unsigned char* b, const unsigned int uiStrideOrg, const unsigned int uiStrideCur )
{
    unsigned short auiSum[8];
    unsigned short* puiSum = auiSum;

    __asm__ volatile(                             \

    /* Load 4 rows of piOrg and piCur each */
    "vld1.8 {q0},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vld1.8 {q4},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q1},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q8,  q0, q4                    \n\t"\
    "vld1.8 {q5},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q2},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q9,  q1, q5                    \n\t"\
    "vld1.8 {q6},[%[piCur]],%[iStrideCur]   \n\t"\
    "vld1.8 {q3},[%[piOrg]],%[iStrideOrg]   \n\t"\
    "vabd.u8 q10, q2, q6                    \n\t"\
    "vld1.8 {q7},[%[piCur]],%[iStrideCur]   \n\t"\
    "vpaddl.u8 q12, q8                      \n\t"\
    "vabd.u8 q11, q3, q7                    \n\t"\
    "vpaddl.u8 q13, q9                      \n\t"\
    "vpaddl.u8 q14, q10                     \n\t"\
    "vadd.u16 q8, q12, q13                  \n\t"\
    "vpaddl.u8 q15, q11                     \n\t"\
    "vadd.u16 q9, q14, q15                  \n\t"\
    "vadd.u16 q0, q8, q9                    \n\t"\
    "vst1.16 {q0}, [%[puiSum]]              \n\t"\
    :[piOrg]        "+r"    (a),
     [piCur]        "+r"    (b),
     [puiSum]       "+r"    (puiSum)
    :[iStrideCur]   "r"     (uiStrideCur),
     [iStrideOrg]   "r"     (uiStrideOrg)
    :"q0","q1","q2","q3","q4","q5","q6","q7","q8","q9","q10","q11","q12","q13","q14","q15"
    );

    unsigned int uiSum += auiSum[0] + auiSum[1] + auiSum[2] + auiSum[3] + auiSum[4] + auiSum[5] + auiSum[6] + auiSum[7];


    return uiSum;
}

Doing such a small amount of work in a function can mean that function call overheads swap any optimisation gains - if you're calling this code in a loop then consider either (a) re-factoring so that the optimised code is in the loop rather than in a separate function, and move any set up /tear down stuff out of the loop or (b) re-write the function using intrinsics and make it inline - that way the compiler can get rid of any function preamble/postamble code and also re-sechdule the instructions and probably even do a better job of allocating registers. — Paul R
Did you check the assembly code that the compiler generates out of the first function? Maybe it is using the neon unit already... — Christoph Freundl
It is not using neon instructions in the unoptimized version — user3249055
Nearly half of that code is loads and stores - are you expecting to find a threefold increase in memory bandwidth out of nowhere? ;) — Notlikethat

Charles Baylis Charles Baylis · Accepted Answer · 2014-09-30T09:51:50

This code performs poorly because the compiler has to emit 23 integer instructions in addition to the 20 NEON instructions in your inline assembler block.

The simplest part to fix is this line:

unsigned int uiSum += auiSum[0] + auiSum[1] + auiSum[2] + auiSum[3] + auiSum[4] + auiSum[5] + auiSum[6] + auiSum[7];

This final reduction step can be performed on the NEON unit. eg

VADDL.S16   q0, d0, d1     // 32 bit lanes in q0
VPADDL.S32  q0, q0         // 64 bit lanes in q0
VADD.I64    d0, d0, d1     // one 64 bit result in d0

You can then retrieve the result with a single move:

VMOV %n, %Hn, d0           // retrieve 64 bit result

In the above, you need to set n to correspond the appropriate operand for the result variable in the inline asm outputs block.

The other problem is the register allocation is suboptimal. The registers d8 to d15 (q4 to q7) must be preserved by any function which uses them, and as a result the compiler emits code to do that. You can rewrite your function to reuse registers and avoid using these registers.

This function would benefit from use of NEON intrinsics. That would avoid the need to worry about register allocation, and will also make your code portable to Aarch64

Arm-neon optimized version of SAD 16*4 not giving expected gain

1 Answers