0
votes

I am trying to understand the possible benefits of compiling C++ code with active neon flags in the gcc compiler. For that I made a little program that iterates through an array and makes simple arithmetic operations.

I changed the code so that anyone can compile and run it. If anyone would be nice enough to perform this test and share results, I'd be much appreciated :)

EDIT: I really ask t someone who happen to have a Cortex-A9 board nearby to perform this test and check if the result is the same. I'd really appreciate that.

#include <ctime>

int main()
{
    unsigned long long arraySize = 30000000;

    unsigned short* arrayShort = new unsigned short[arraySize];

    std::clock_t begin;

    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort = rand() % 100 + 1;
        arrayShort++;
    }

    arrayShort -= arraySize;

    begin = std::clock();
    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort += 10;
        *arrayShort /= 3;

        arrayShort++;
    }

     std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

    arrayShort -= arraySize;
    delete[] arrayShort;

    return 0;
}

Basically, I fill a 30000000 sized array with random numbers between 1 and 100, and then I go through all elements to sum 10 and divide by 3. I was expecting that compiling this code with active neon flags would lead to great improvements due to its capability of making multiple array operations at a time.

I am compiling this code to run in a Cortex A9 ARM board using Linaro toolchain with GCC 4.8.3. I compiled this code with and without the following flags:

-O3 -mcpu=cortex-a9 -ftree-vectorize -mfloat-abi=hard -mfpu=neon 

I also replicated the code to run with an array of type unsigned int, float and double, and these are the results in seconds:

Array type unsigned short: 
With NEON flags: 0.07s
Without NEON flags: 0.089s

Array type unsigned int: 
With NEON flags: 0.524s
Without NEON flags: 0.529s

Array type float: 
With NEON flags: 0.65s
Without NEON flags: 0.673s

Array type double: 
With NEON flags: 0.955s
Without NEON flags: 0.927s

You can see that for the most part, there is almost no improvement in using the neon flags, and it even leads to worse results in the case of the array of doubles.

I really feel that I'm doing something wrong here, possibly you can help me interpreting these results.

2
What -O flag are you using? If it's "none" then note that benchmarking unoptimised code is utterly meaningless. I tried compiling to see what the assembly looks like, but being a bit C++-challenged I don't know where to find Timer and RNG.Notlikethat
-O3, or don't bother. The array of doubles will not benefit from ARMv7 NEON, and even if you shrink it down to float (which can), you need -ffast-math.unixsmurf
NEON does not support integer division, so there's nothing to vectorize. Try a multiply instead.Yves Daoust
I've done this sort of test and the latest GCC still doesn't vectorize properly. Microsoft's ARM compiler can do some NEON vectorization. If you want fast ARM/NEON code, write assembly language. Depending on the compiler for optimized performance is rarely the right option (in my experience).BitBank
Auto-vectorizations are utterly useless most of the time, regardless of compiler.Jake 'Alquimista' LEE

2 Answers

3
votes

I had to fix up your code with:

#include <iostream>
#include <cstdlib>

After which, GCC 5.0 autovectorizes your loop as so:

.L7:
    vld1.64 {d16-d17}, [r1:64]
    adds    r4, r4, #1
    vadd.i16    q8, q8, q11
    adc r5, r5, #0
    cmp r3, r5
    add r1, r1, #16
    vmull.u16 q9, d16, d20
    cmpeq   r2, r4
    vmull.u16 q8, d17, d21
    add lr, lr, #16
    vuzp.16 q9, q8
    vshr.u16    q8, q8, #1
    vstr    d16, [lr, #-16]
    vstr    d17, [lr, #-8]
    bhi .L7

So yes, the compiler can autovectorize the code, but is it any good? On a Cortex-A7 board I have nearby, I see the following times:

g++ ~/foo.cpp -O3
./a.out 
Time: 129.355 ms

g++ ~/foo.cpp -O3 -fno-tree-vectorize
./a.out 
Time: 430.405 ms

Which looks about what you would hope for a 4x vectorization factor (4x16-bit values).

In this case, I think the data and the generated assembly code speaks for itself, and refutes some of the claims in the comments above. The compiler can, and will, perform auto-vectorisation, and the performance you can achieve from it is a meaningful speedup.

Also of note, the compiler has beaten one of the expert assembly programmers from the comments!

NEON does not support integer division, so there's nothing to vectorize. Try a multiply instead.

True in the general case, yes. But efficient sequences exist to divide by particular constants using Neon, and '3' happens to be one of those constants!

My Linaro/Ubuntu GCC 4.8.2 system compiler also vectorizes this code, producing very similar code to the above, with similar timings.

0
votes

I attempted to re-write this code using the arm_neon.h intrinsics, and the results are very surprising.. so much so that I need some help interpreting them.

Here is the code:

#include <ctime>
#include <stdio.h>
#include <cstdlib>
#include <arm_neon.h>

int main()
{
    unsigned long long arraySize = 125000000;

     std::clock_t begin;

    unsigned short* arrayShort = new unsigned short[arraySize];

    for (unsigned long long n = 0; n < arraySize; n++)
    {
        *arrayShort = rand() % 100 + 1;
        arrayShort++;
    }

    arrayShort -= arraySize;

    uint16x8_t vals;
    uint16x8_t constant1 = {10, 10, 10, 10, 10, 10, 10, 10};
    uint16x8_t constant2 = {3, 3, 3, 3, 3, 3, 3, 3};

    begin = std::clock();
    for (unsigned long long n = 0; n < arraySize; n+=8)
    {
        vals = vld1q_u16(arrayShort);
        vals = vaddq_u16(vals, constant1);
        vals = vmulq_u16(vals, constant2);

//      std::cout << vals[0] <<  "   " << vals[1] <<  "   " << vals[2] <<  "   " << vals[3] <<  "   " << vals[4] <<  "   " << vals[5] <<  "   " << vals[6] <<  "   " << vals[7] <<  std::endl;

        arrayShort += 8;
    }

    std::cout << "Time: " << (std::clock() - begin) / (double)(CLOCKS_PER_SEC / 1000) << " ms" << std::endl;

    arrayShort -= arraySize;
    delete[] arrayShort;

    return 0;
}

So, now I am creating a 125 million element long array of unsigned shorts. Then I go over 8 elements at a time and and add 10 and then multiply it by 3.

On a cortex A9 board, the plain C++ version of this code takes 270 milliseconds to process that array, while this NEON code takes only 20 milliseconds.

Now, my expectations before seeing the results weren't to high, but, the best scenario in my head was a 8x time reduction. I cannot explain how this lead to a 13.5x reduction in execution time.. and I'd appreciate some help interpreting these results.

I've obviously seen the result output of the math being done and I can assure the code is working and the results are very coherent.