5
votes

Learning about ARM NEON intrinsics, I was timing a function that I wrote to double the elements in an array.The version that used the intrinsics takes more time than a plain C version of the function.

Without NEON :

    void  double_elements(unsigned int *ptr, unsigned int size)
 {
        unsigned int loop;
        for( loop= 0; loop<size; loop++)
                ptr[loop]<<=1;
        return;
 }

With NEON:

 void  double_elements(unsigned int *ptr, unsigned int size)
{    
        unsigned int i;
        uint32x4_t Q0,vector128Output;
        for( i=0;i<(SIZE/4);i++)
        {
                Q0=vld1q_u32(ptr);               
                Q0=vaddq_u32(Q0,Q0);
                vst1q_u32(ptr,Q0);
                ptr+=4;

        }
        return;
}

Wondering if the load/store operations between the array and vector is consuming more time which offsets the benefit of the parallel addition.

UPDATE:More Info in response to Igor's reply.
1.The code is posted here:
plain.c
plain.s
neon.c
neon.s
From the section(label) L7 in both the assembly listings,I see that the neon version has more number of assembly instructions.(hence more time taken?)
2.I compiled using -mfpu=neon on arm-gcc, no other flags or optimizations.For the plain version, no compiler flags at all.
3.That was a typo, SIZE was meant to be size;both are same.
4,5.Tried on an array of 4000 elements. I timed using gettimeofday() before and after the function call.NEON=230us,ordinary=155us.
6.Yes I printed the elements in each case.
7.Did this, no improvement whatsoever.

3
Thanks. Can you try with optimizations (e.g. -O3)? It seems there's a lot of redundant code which might be affecting timings.Igor Skochinsky
Thanks for the reply.The -O3 brought down the time from 230us to 95us.itisravi

3 Answers

3
votes

The question is rather vague and you didn't provide much info but I'll try to give you some pointers.

  1. You won't know for sure what's going on until you look at the assembly. Use -S, Luke!
  2. You didn't specify the compiler settings. Are you using optimizations? Loop unrolling?
  3. First function uses size, second uses SIZE, is this intentional? Are they the same?
  4. What is the size of the array you tried? I don't expect NEON to help at all for a couple of elements.
  5. What is the speed difference? Several percents? Couple of orders of magnitude?
  6. Did you check that the results are the same? Are you sure the code is equivalent?
  7. You're using the same variable for intermediate result. Try storing the result of the addition in another variable, that could help (though I expect the compiler will be smart and allocate a different register). Also, you could try using shift (vshl_n_u32) instead of the addition.

Edit: thanks for the answers. I've looked a bit around and found this discussion, which says (emphasis mine):

Moving data from NEON to ARM registers is Cortex-A8 is expensive, so NEON in Cortex-A8 is best used for large blocks of work with little ARM pipeline interaction.

In your case there's no NEON to ARM conversion but only loads and stores. Still, it seems that the savings in parallel operation are eaten up by the non-NEON parts. I would expect better results in code which does many things while in NEON, e.g. color conversions.

4
votes

Something like this might run a bit faster.

void  double_elements(unsigned int *ptr, unsigned int size)
{    
    unsigned int i;
    uint32x4_t Q0,Q1,Q2,Q3;

    for( i=0;i<(SIZE/16);i++)
    {
            Q0=vld1q_u32(ptr);               
            Q1=vld1q_u32(ptr+4);               
            Q0=vaddq_u32(Q0,Q0);
            Q2=vld1q_u32(ptr+8);               
            Q1=vaddq_u32(Q1,Q1);
            Q3=vld1q_u32(ptr+12);               
            Q2=vaddq_u32(Q2,Q2);
            vst1q_u32(ptr,Q0);
            Q3=vaddq_u32(Q3,Q3);
            vst1q_u32(ptr+4,Q1);
            vst1q_u32(ptr+8,Q2);
            vst1q_u32(ptr+12,Q3);
            ptr+=16;

    }
    return;
}

There are a few problems with the original code (some of those the optimizer may fix but other it may not, you need to verify in the generated code):

  • The result of the add is only available in the N3 stage of the NEON pipeline so the following store will stall.
  • Assuming the compiler is not unrolling the loop there may be some overhead associated with the loop/branch.
  • It doesn't take advantage of the ability to dual issue load/store with another NEON instruction.
  • If the source data isn't in cache then the loads would stall. You can preload the data to speed this up with the __builtin_prefetch intrinsic.
  • Also as others have pointed out the operation is fairly trivial, you'll see more gains for more complex operations.

If you were to write this with inline assembly you could also:

  • Use the aligned load/stores (which I don't think the intrinsics can generate) and ensure your pointer is always 128 bit aligned, e.g. vld1.32 {q0}, [r1 :128]
  • You could also use the postincrement version (which I'm also not sure intrinsics will generate), e.g. vld1.32 {q0}, [r1 :128]!

95us for 4000 elements sounds pretty slow, on a 1GHz processor that's ~95 cycles per 128bit chunk. You should be able to do better assuming you're working from the cache. This figure is about what you'd expect if you're bound by the speed of the external memory.

3
votes

Process in bigger quantities per instruction, and interleave load/stores, and interleave usage. This function currently doubles (shifts left) 56 uint.

void shiftleft56(const unsigned int* input, unsigned int* output)
{
  __asm__ (
  "vldm %0!, {q2-q8}\n\t"
  "vldm %0!, {q9-q15}\n\t"
  "vshl.u32 q0, q2, #1\n\t"
  "vshl.u32 q1, q3, #1\n\t"
  "vshl.u32 q2, q4, #1\n\t"
  "vshl.u32 q3, q5, #1\n\t"
  "vshl.u32 q4, q6, #1\n\t"
  "vshl.u32 q5, q7, #1\n\t"
  "vshl.u32 q6, q8, #1\n\t"
  "vshl.u32 q7, q9, #1\n\t"
  "vstm %1!, {q0-q6}\n\t"
  // "vldm %0!, {q0-q6}\n\t" if you want to overlap...
  "vshl.u32 q8, q10, #1\n\t"
  "vshl.u32 q9, q11, #1\n\t"
  "vshl.u32 q10, q12, #1\n\t"
  "vshl.u32 q11, q13, #1\n\t"
  "vshl.u32 q12, q14, #1\n\t"
  "vshl.u32 q13, q15, #1\n\t"
  // lost cycle here unless you overlap
  "vstm %1!, {q7-q13}\n\t"
  : "=r"(input), "=r"(output) : "0"(input), "1"(output)
  : "q0", "q1", "q2", "q3", "q4", "q5", "q6", "q7",
    "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15", "memory" );
}

What's important to remember for Neon optimization... It has two pipelines, one for load/stores (with a 2 instruction queue - one pending and one running - typically taking 3-9 cycles each), and one for arithmetical operations (with a 2 instruction pipeline, one executing and one saving its results). As long as you keep these two pipelines busy and interleave your instructions, it will work really fast. Even better, if you have ARM instructions, as long as you stay in registers, it will never have to wait for NEON to be done, they will be executed at the same time (up to 8 instructions in cache)! So you can put up some basic loop logic in ARM instructions, and they'll be executed simultaneously.

Your original code also was only using one register value out of 4 (q register have 4 32 bits values). 3 of them were getting a doubling operation for no apparent reason, so you were 4 times as slow as you could've been.

What would be better in this code is to for this loop, process them embedded by adding vldm %0!, {q2-q8} following the vstm %1! ... and so on. You also see I wait 1 more instruction before sending out its results, so the pipes are never waiting for something else. Finally, note the !, it means post-increment. So it reads/writes the value, and then increments the pointer from the register automatically. I suggest you don't use that register in ARM code, so it won't hang its own pipelines... keep your registers separated, have a redundant count variable on ARM side.

Last part ... what I said might be true, but not always. It depends on the current Neon revision you have. Timing might change in the future, or might not have always been like that. It works for me, ymmv.