ARM neon performance issue

Question

Consider the two following pieces of code, the first is the C version :

void __attribute__((no_inline)) proj(uint8_t * line, uint16_t length)
{
    uint16_t i;
    int16_t tmp;
    for(i=HPHD_MARGIN; i<length-HPHD_MARGIN; i++) {
        tmp = line[i-3] - 4*line[i-2] + 5*line[i-1] - 5*line[i+1] + 4*line[i+2] - line[i+3];
        hphd_temp[i]=ABS(tmp);
    }
}

The second is the same function (except for the border) using neon intrinsics

void  __attribute__((no_inline)) proj_neon(uint8_t * line, uint16_t length)
{
    int i;
    uint8x8_t b0b7, b8b15, p1p8,p2p9,p4p11,p5p12,p6p13, m4, m5;
    uint16x8_t result;

    m4 = vdup_n_u8(4);
    m5 = vdup_n_u8(5);
    b0b7 = vld1_u8(line);
    for(i = 0; i < length - 16; i+=8) {
        b8b15 = vld1_u8(line + i + 8);
        p1p8  = vext_u8(b0b7,b8b15, 1);
        p2p9  = vext_u8(b0b7,b8b15, 2);
        p4p11 = vext_u8(b0b7,b8b15, 4);
        p5p12 = vext_u8(b0b7,b8b15, 5);
        p6p13 = vext_u8(b0b7,b8b15, 6);

        result = vsubl_u8(b0b7, p6p13); //p[-3]
        result = vmlal_u8(result, p2p9, m5); // +5 * p[-1];
        result = vmlal_u8(result, p5p12, m4);// +4 * p[1];
        result = vmlsl_u8(result, p1p8, m4); //-4 * p[-2];
        result = vmlsl_u8(result, p4p11, m5);// -5 * p[1];
        vst1q_s16(hphd_temp + i + 3, vabsq_s16(vreinterpretq_s16_u16(result)));
        b0b7 = b8b15;
    }
    /* todo : remaining pixel */

}

I am disappointed by the performance gain : it is around 10 - 15 %. If I look at the generated assembly :

C version is transformed in a 108 instruction loop
Neon version is transformed in a 72 instruction loop.

But one loop in the neon code computes 8 times as much data as an iteration through the C loop, so a dramatic improvement should be seen.

Do you have any explanation regarding the small difference between the two version ?

Additional details : Test data is a 10 Mpix image, computation time is around 2 seconds for the C version.

CPU : ARM cortex a8

It is not uncommon to see lesser performance gains than expected, when you talk about instruction counts, vs performance times. Particularly when dealing with large amounts of data, like images. It is likely that, despite your 36 instruction difference, that most of the time you are waiting for things to shift around in memory, and that even with such a difference in instructions, most of your performance boost is coming from the neon code handling memory better(branch prediction, larger chunks per loop, fewer instructions etc) than from the number of instructions executed. — ChrisCM
What compiler, compiler version and compiler command line parameters are you using? Could you include disassembly of the intrinsic version? — unixsmurf
@ChrisCM : This is not about 108 versus 72, but 108*8 versus 72. Even taking dual issue into account, I could still expect a 6x improvement. — shodanex
I'm slightly surprised that you're not rearranging your terms like tmp = 1*(line[i-3] - line[i+3]) + 4*(line[i+2] - line[i-2]) + 5*(line[i-1] - line[i+1]) and just calculate three differences plus a dot product at the end (it'd be a different gather-load / vector sub / parallel multiply sequence). But that'll only help if you're not memory-bandwidth-constraint already. It'd be worth issuing __pld(*(line + 8)) before the loop, and __pld(*(line + i + 16)) inside the loop. — FrankH.
Can you post the disassembly of your code? It's well known that the GNU compiler sometimes does something crazy with NEON intrinsics. (I smell a rat here) — Jake 'Alquimista' LEE

Michael Dorgan Michael Dorgan · Accepted Answer · 2013-06-10T15:48:30

I'm going to take a wild guess and say that caching (data) is the reason you don't see the big performance gain you are expecting. While I don't know if your chipset supports caching or at what level, if the data spans cache lines, has poor alignment, or is running in an environment where the CPU is doing other things at the same time (interrupts, threads, etc.), then that also could muddy your results.

ARM neon performance issue

1 Answers